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The  present  research  was   conducted   in  the   area  of  aural  speaker 
identification.     The  purposes  of  this    investigation  were   (1)   to 
determine  whether  (a)    listeners  who  did  and   (b)    listeners  who  did  not 
know  the  speakers   performed  equally  well  on  the  task  of  speaker   identi- 
fication— and/or  differed   in  their  types  of  responses;    (2)   to  determine 
whether  listener  performance  was  the   same   for  all  vowels;   (3)   to  deter- 
mine  the  effect  of  controlled  fundamental  frequencies   (fo)    on  speaker 
identification;    (4)   to  determine  the   relationship  between  speaking  funda- 
mental frequency  (SFF)   and  speaker  identification — and  SFF  and  speaker 
confusions;    and  (5)   to  determine  the  relationship  between  formant 
frequencies   and  speaker  confusions. 

The  subjects  for  this   study  were  six  male   speakers   and  two  groups  of 
listeners,  eight  who  knew  the   speakers   and  eight  who  did  not.     The 
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listeners  were  presented  with  a  training  session  consisting  of  a  passage 
read  by  the   speakers.      Following  the   training  session  they  attempted  to 
identify  the  speakers   from  recorded  stimuli  consisting  of   (a)   sentences 
and   (b)    four  vowels   at  each  of  four  fundamental  frequencies. 

The   sentence   stimuli  resulted   in  the  highest  speaker   identification 
performance  for  all  listeners.     The   following  results  were   noted  for  the 
vowel  stimuli.     (1)  The   listeners  who  knew  the  speakers   (Group  I) 
performed  significantly  above  chance  while   the   listeners  who  did  not  know 
the   speakers   (Group  II)   did  not  perform  above   chance.     Further,  Group  I 
performed  significantly  better  than  Group  11.      (2)   Low  vowels  resulted   in 
significantly  better  performance   than  high  vowels  for  Group  I.     (3)  The 
two   lower  fundamental  frequencies  yielded  signif iccintly  higher  perfor- 
mance than  the  two  higher  fundamental  frequencies  for  Group  I. 
(4)  Speakers  whose  SFFs  deviated  most  from  the   group  mean  SFF  were  signi- 
ficantly better   identified  by  Group  II.      (5)   Group  I  tended   to  confuse 
speakers  whose  SFFs  were   similar.     Group  II  confused  speakers  whose  SFFs 
were  similar  at  the   two  higher  f©  levels,   but   at  the  two   lower  f©  levels 
confused  a  speaker  with  another  speaker  whose  SFF  was   close   to  the  fo 
level  produced.     (6)   For  both  listener  groups  when  the   speakers  were 
paired  according  to  SFF,   the  two  pairs  with  the  highest   and  lowest  SFFs 
were   perceived  approximately  twice   as   often  when  the  fo  level  was  most 
similar  to  their  SFFs   as  when  it  was  most  different  from  their  SFFs. 
(7)   Forraant  frequency  means  were  not  significantly  correlated  with 
speaker  confusions. 

The  major  conclusions  derived  from  the  present  study  follow. 

(1)   Naive  and  acquainted  listeners  represent  different  subject  types 
in  both  degree  of   identification  ability  and   in  kind  of  response, 
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(2)  Those  cues   pertinent   to  speaker   identification  which  are   present 
in  connected  speech  (a)   are  quickly  assimilated  by  naive   listeners  and 

(b)   are  present   in  lesser  number  or  to  a  lesser  degree   in  isolated  vowels. 

(3)  Some  cues   pertinent  to  differentiation  among  speakers   are   (a) 
present  only  in  low  vowels,   (b)   present  to  a  greater  degree   in  low 
vowels   or  (c)  more  readily  perceived   in  low  vowels.     It  may  be   that  some 
characteristic(s)   of  the  first  formant   is   (are)    important  to  the    identi- 
fication of  speakers. 

(4)  High   identification  performance   on  sentence   stimuli  is   not  a 
predictor  for  performance   on  isolated  vowel  stimuli,   at   least  for  naive 
listeners. 

(5)  Some  characteristic(s)  of  a  speaker's  voice  is  (are)  distorted 
when  he  produces  an  fg  different  from  his  SFF.  Most  likely  this  change 
is   not  fo  per  se   but   is  related  to  f©* 

(6)  Both  listener  types   attended  to  SFF,  yet  the   acquainted 
listeners   seem  to  have  focused  on  some   aspect  of,   or  concomitant  to,  SFF 
that  remained  relatively  constant   across  fo  changes.     The  naive   listeners, 
on  the   other  hand,   appear  to  have   perceived  an  aspect  of  SFF  that 
reflected  fo  changes. 

It  appears   a  tenable  conclusion,   also,   that  those  parameters 
containing  cues   specific  to  voice   comprise   a  "Gestalt"  and  that  exami- 
nation of  any  parameter  separately  will  not  yield  definitive   information 
regarding  aural   identification  tasks   using  voice   stimuli. 


CHAPTER  I 
INTRODUCTION 

Identification  of  speakers  by  their  voices  alone  is  a  phenomenon 
which  can  be  observed  daily,  the  most  familiar  example  involving  speaker 
identification  on  the  telephone.  An  equally  familiar  phenomenon  is  the 
misidentif ication  of  talkers  on  the  telephone  or  in  other  non-visual 
situations*  Research  in  the  area  of  aural  speaker  identification  has 
been  conducted  in  an  effort  to  determine  the  underlying  parameters  (a) 
that  enable  people  to  be  identified  by  voice  and/or  (b)  that  cause 
confusions  among  speakers.  This  research  is  important  not  only  for 
determination  of  the  relevant  parameters  of  speaker  identification,  but 
also  for  a  description  of  the  conditions  that  are  applicable  to  the 
forensic  situation. 

In  the  situation  noted  above  involving  use  of  the  telephone,  if  a 
call  which  was  expected  is  eliminated,  judgement  of  a  speaker's  identity 
must  be  based  solely  on  perceptual  evaluation  of  the  acoustic  signal. 
Wien,  however,  the  acoustic  signal  is  transformed  by  a  spectrograph  into 
visual  form,  with  certain  of  the  acoustic  parameters  represented — 
specifically  frequency,  intensity  and  time — ,  another  identification 
procedure  is  utilized:  so-called  "voice print"  analysis.  Utilization  of 
this  "voiceprint"  analysis  by  law  enforcement  agencies  is  reported  in 
popularised  form  in  Sanders*  The  Anderson  Tapes  (1970).  In  the  extensive 
investigations  reported  in  that  book,  various  law  enforcement  agencies 
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"proved"  the  identity  of  particular  persons  by  spectrographic  analysis  of 
tape  recordings •  In  the  telephone  situation  above,  misinterpretations  of 
the  perceptual  cues  would  result  in  confusion  or,  at  worst,  embarrass- 
ment. However,  in  the  latter  situation  an  erroneous  interpretation  of 
the  visual  cuss  could  have  legal  consequences.  And,  while  this  type  of 
analysis  hes  occasionally  been  accepted  for  legal  purposes,  there  exists 
substantial  doubt  as  to  the  validity  of  the  procedure. 

Kersta  (1962),  reporting  extremely  high  percent -correct  levels  of 
identification  of  spjeakers  by  trained  observers  using  spectrographic 
analysis,  stated  that   "It  is  ray  opinion  .  •  .  that  identifiable 
uniqueness  does  exist  in  each  voice,  and  that  masking,  disguising,  or 
distorting  the  voice  will  not  defeat  identification  if  the  speech  is 
intelligible."  He  did  caution,  however,  that  more  research  is  needed 
utilizing  a  large  population  sample.  Bolt  et  al.  (1970)  pointed  out  that 
results  of  studies  on  the  identification  of  speakers  from  spectrograms 
are  dependent  upon  the  observers'  training  and  upon  the  task.  Tosi  et  al. 
(1971)  did  attempt  to  resolve  the  problems  of  (1)  observer  training  and 
(2)  the  nature  of  the  task,  but  the  experimental  conditions  were  very 
limited.  Two  specific  interpretive  cautions  are  necessary:   (1)  these 
studies  were  conducted  within  carefully  controlled  laboratory  situations 
and,  (2)  the  spectrograph  presents  only  particular  parameters  of  the 
acoustic  signal,  not  all  parameters. 

These  data  do  indicate  the  need  for  further  research  in  the  spectro- 
graphic analysis  area,  but  the  judgement  being  attempted  is,  in  reality, 
judgement  of  a  visually  represented  acoustic  signal.  Therefore,  research 
based  on  the  original  acoustic  signal — specifically,  investigation 
utilizing  aural  speaker  identification  tasks — is  particularly  useful 


toward  determining  the  discriminating  acoustic  parameters. 

Certainly  speaker   identification  research  may  be   justified   in  consi- 
deration of   its   legal  application  or   in  regard   to  utilization  of  the 
derived  data  by  private   industry  or  governmental  agencies  for  security 
purposes.     Interesting  as  these  suggested  aspects  may  be,   a  more   primal 
motivation  for  research   in  this   area  is  the  necessity  for  obtaining  basic 
information  regarding  the  speech  signal  and   its   component  parjiraeters. 

For  example,   studies  have   been  conducted  to  determine   the   parameters 
that  differentiate  the  various   phonemes   (e.g.,   Peterson  and   Barney,   1952). 
It   is   of  equal   interest  to  determine   those   factors  which  allow  for 
discrimination  between  speakers.     It   is  not  as  yet  known  whether  those 
parameters  which  differentiate   phonemes  are   the   same   parameters  which  are 
utilized  to  distinguish  between  speakers.     There   is,    in  fact,   some 
evidence   to  suggest  the   converse.     Thus,   a  model  for  speaker   identification 
may  be  dissimilar  to  a  model  for  speech  perception. 

Indeed,    in  the  area  of  speaker   identification,   there  exists   a  factor 
that    is  not  pertinent  to  other  speech  research,   but  one  which  may  be  of 
considerable  relevance  to  the  speaker   identification  areat      the   subject, 
or  listener,  may  or  may  not  know  the  speaker.     It   is  not  yet  known 
whether  this  parameter  is   sufficient  to  describe  two  distinct   types   of 
listeners  who  may  respond  differently  under  the  same   conditions.      It   is 
possible  that  a  speaker  identification  model  may  in  fact  be  two  mode  Is i 
one  model  which   is   appropriate  for   listeners  who  know  the   speakers  and   a 
second  model  pertinent  to  listeners  who  have  not  yet   "learned"  a  speaker's 
voice. 

In  any  event,  more    information  is  necessary  before  such  a  modeling 
atteiT'pt   is  made.     The  present  study  will  therefore  be   confined  to 
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examination  of  selected  factors  with  potential   importance   to  aural 
speaker  identification. 

Review  of  the  Literature 

Pollack,   Pickett   and  Sumby  (1954)  examined  several  parameters 
concerning  aural  speaker   identification.     Four  talkers  of  similar  age, 
with  similar  manner  and  rate  of  speaking,   read  the   same   text.     Comparison 
of  their  whispered  and  voiced  passages  showed  that   in  order  to  obtain 
similar  levels   of  correct   identification,  the  whispered  samples  required 
an  utterance  duration  of  about  three   times   that  of  the  voiced  s simples. 
Such  a  finding  would  suggest  that   some  factor  (or  factors)   normally  used 
for   identification   is   (are)  missing   in  the   acoustic  signal  of  whispered 
speech!      that   is,   the  fundamental  frequency  (or  some  other  property  of 
the  glottal  wave)  may  assist   in  speaker   identification.     For  voiced 
samples,   correct  speaker  identification   increased  with   increased  sample 
duration  up  to  a   length  of  about   1200  milliseconds  and  beyond  this 
duration  there  was  no   improvement   in  listener  performance.     The   authors 
stated  that  duration  was    important  only  in  that   it  allowed  a  larger 
sampling  of  a  speaker's   phoneme  rep>etoire  to  be  evaluated  by  the 
listeners.     Additionally,   the   investigators  filtered  the  stimuli  at 
various  high-  and  low-pass   cut-off  frequencies.     From  analysis   of  the 
filtering  data  they  concluded  that    identification  of  speakers    is  not 
critically  dependent  upon  any  single   portion  of  the  frequency  spectrum. 

Compton  (1963)    investigated  filtering  and  duration  with  respect  to 
speaker  identification.     Samples   of  /i/  produced  by  nine  male   speakers 
were  varied   in  duration  and  were  presented  under  one  of  seven  conditions! 
three  with  high-'pass  filtering,   three  with  low-pass  filtering  and  one 
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without  filtering.     The   interaction  he  observed  between  duration  and  fil- 
tering was  J      the  more   severe   the   filtering — low-pass   filtering — ,    the 
greater  the  duration  required  for  speaker   identification.     Overall,  high- 
pass  filtering  did  not   greatly  reduce   identification  of  speakers  whereas 
low-pass  filtering  resulted   in  a  significant  reduction  of  percent -correct 
identification.     This  finding  suggests   that   the   higher  frequencies   are 
more   critical  for  speaker   identification,   at   least  for  the  vowel  /i/» 
This  result  appears   to  be   at  variance  with  the   observations  of  Pollack, 
Pickett  and  Suraby  (195A)   who  found  that  filtering  in  any  particular 
frequency  range   did   not  greatly  reduce    identification.     However,   Pollack, 
Pickett  and  Sumby  presented  samples   of  connected  speech,   thus    identifi- 
cation of  speakers  depended  not  upon  an  isolated  phoneme   but   instead 
depended  on  perception  of  several  phonemes.      It  would   appear,   therefore, 
that  for  connected  speech  any  given  frequency  range   could  contain 
sufficient  speaker   identification  cues  to  compensate   for  the   loss   of  other 
frequencies.     This   probable   compensation  could  not  occur   in  Compton's 
study  since   only  the   isolated  vowel  /i/  was   used.     Further,  Stevens  et   al« 
(1968)   suggest  that  for  the  vowel  /i/  the   glottal  source  characteristics 
could  be   easily  perceived  from   its  high  second  forraant  frequency  and   that 
this   perception  of  source   characteristics   could  assist   in  making   identi- 
fication judgements. 

Compton  (1963)   further  reports  that  while  durations  of  one-fortieth 
of  a  second  were   sufficient  for   identifying  speakers   above   a  chance   level, 
identification  performance   showed   improvement  with   increased  utterance 
duration,  with  performance    improvement   leveling  off  at   the   100-250  milli- 
second duration  range.     Although  the  values  reported   by  Compton  are   lower 
than  those  reported  by  Pollack,   Pickett  and  Sumby  (195A),   and  the   two 
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studies  utilized  different  stimuli — isolated  vowel  versus  connected 
speech — ,  the  trend  was  consistent!   identification  performance  first 
improved  as  sample  duration  was  increased,  and  performance  then  leveled 
off. 

In  the  same  study,  Corapton  (1963)  found  that  the  order  of  confusions 
for  speakers  was  similar  across  the  conditions  of  filtering  and  duration, 
and  his  results  were  examined  in  terras  of  the  fundamental  frequencies  of 
the  vowel  productions.  Rank-order  correlations  were  performed  on  the 
number  of  actual  and  expected  speaker  confusions  based  on  these  funda- 
mental frequencies,  and  Corapton  reported  that  all  correlations  were  signi- 
ficant at  the  five  percent  level  of  confidence.  His  data,  then,  indicate 
that  the  more  similar  the  fg  of  any  two  speakers,  the  more  they  were 
confused  with  each  other  by  listeners. 

Bricker  and  Pruzansky  (1966)  studied  the  effects  of  context  and 
duration  on  speaker  identification.  After  ten  male  speakers  produced  a 
number  of  sentences,  monosyllables  and  disyllables,  the  experimenters 
gated  out  both  CV  and  V  samples  of  the  same  duration  from  the  sets  of 
four  disyllables.  While  listeners  attained  high  identification  scores 
from  the  sentence  samples,  accuracy  decreased  with  shorter  durations. 
Because  the  CV  excerpts  resulted  in  higher  identification  performance  than 
the  vowel  excerpts  of  the  same  duration,  the  authors  concluded  that  the 
number  of  phonemes  within  the  utterance  is  of  more  importance  than  its 
duration.  It  is  not  possible  to  compare  the  decrement  in  performance 
which  Pollack,  Pickett  and  Sumby  (1954)  reported  when  duration  was  less 
than  1200  milliseconds  with  the  results  of  Bricker  and  Pruzansky  because, 
in  the  latter  study,  the  duration  was  substantially  under  or  over  1200 
milliseconds.  The  performance  for  the  short  samples  of  monosyllables 


(500  milliseconds)  was  comparable  to  the   {>erfonnance  found  by  Pollack,       '< 

j 
Pickett  and  Suraby  at  this  duration,  however.     Further,   the  performance       ; 

levels   obtained  from  these  two  investigations  were  higher  than  those 

t 

4 

reported  by  Compton  (1953),  These  differences  would  seem  to  indicate    ; 
that  the  larger  number  of  phonemes  per  stimulus  is  of  more  importance  thai 

stimulus  duration  per  se.  These  results  would  further  suggest  that  cues  | 

i 

found   in  connected  speech   (e.g.,   transitions,    inflection,  rate)   assist   in' 
identification  judgements.  1 

Bricker  and  Pruzansky  (1966)  also  found  that  speaker-ranking  for  the! 
vowel  excerpt  /i/  differed  from  ranking  for  the  vowel  excerpt  /a/,  i.e.,  i 
a  speaker  who  was  identified  the  best  for  /i/  would  not  necessarily  be  j 
identified  at  the  highest   level  for  /a/.     The  data  further  showed  that 

confusion  in  identification  between  pairs   of  talkers  was  not  necessarily 

I 
reciprocal,    i.e.,   speaker  1  could  be   identified  as   speaker  2  without  j 

speaker  2  being  in  turn   identified  as  speaker  1.     From  analysis   of  the 
data  provided  by  the  authors,    it  becomes  evident  that  there   is   listener 
confusion  of  three  or  more  speakers  with  each  other  rather  than  confusion 
being  limited  to  a  single   talker  pair.     To   illustrate  this  observation, 
the  Bricker  and  Pruzansky  data  have   been  plotted   (in  part)   as  Figure   1. 
The  values   provided  are   the  number  of  mis  identifications — i.e.,   confu- 
sions— and  the   arrows   indicate  the   direction  of  these  erroneous   identifi- 
cations.     In  these  examples   and  for  both  vowels   it   is   apparent  that 
speaker  GH  was  misidentif ied  most  often  as  speaker  EP,  while   speaker  EP 
was   identified  wrongly  as  speaker  NG  more  often  than  as  speaker  GH.      In 
this  same  figure   it  may  be  noted  that  the  number  of  misidentif icat ions 
differs  from  /i/  to  /a/. 

Bricker  and  Pruzansky  (1966)   also  utilized  naive   listeners,    i.e.. 


/a/ 


I'M 


Figure  1»  Speaker  confusions  as  derived  from  the 
data  of  Bricker  and  Pruzansky,  1966 
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observers  not  acquainted  with  the  speakers,    in  order  to  eliminate  any- 
possible  response  bias  on  the  part  of   listeners  due  to  differing 
fa^Tiiliarity  with  the   speakers.     Naive   listeners  were   presented  with 
sentences    in  an  AXB  paradigm  where   sample  A  was  followed  by  sample  X 
(a  reverse  dubbing  of  either  A  or  B)   and  then  by  sample  B.     The   listener 
was   to  judge  whether  X  was   actually  A  or  B.     The  results  of  this 
procedure    indicate  that  naive   listeners   tended  to  make  more  reciprocal 
speaker  confusions   than  did   listeners  who  knew  the  speakers.     Although 
the  results  regarding  reciprocal  and  non-reciprocal  confusions  were  not 
presented   in  detail,  an   inference  may  be  drawn  from  the  differential 
performance   of  the  two  listener  groups j      it  would  appear  that  the 
acquainted  and  not-acquainted  groups  may  have   utilized  different 
parameters    in  making  identification  judgements.     While  Bricker  and 
Pruzansky  did  not  make  direct  conparison  or  perform  statistical  testing, 
it   is  noteworthy  that  the  overall  performance   level,   i.e.,  jiercent-correct 
identification  scores,   of  the  naive   listeners   in  the  AXB  task  was  consi- 
derably below  the   overall  performance   level  of  the   acquainted   listeners 
for  their  experimental  tasks. 

It   is   apparent   in  both  the  Bricker  and  Pruzansky  (1966)   and  the 
Corapton  (1963)   studies   that   listeners   tended  to  confuse   particular 
speakers  consistently.     Considering  that  the  one   study  utilized  a  single 
vowel  (Compton)   and  the   other  utilized  both  monosyllables   and  sentences 
(Bricker  and  Pruzansky),   this  consistency   in  confusions  argues   for  the 
existence  of  persistent   (although  as  yet  unidentified)   perceptual  cues 
intrinsic  to  a  speaker. 

Stevens  et  al.   (1968)   used  words   and  phrases   as  stimulus  materials, 
with  speakers   selected  as   being  homogeneous  for  vocal  tract   length. 
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speaking  fundamental  frequency  (SFF)   and  rate  of  speech.     The   listeners, 
who  did  not  know  the  speakers,  were  required  to  match  a  sample  word  or 
phrase   to  one   of  eight  presentations  of  the   same  word  or  phrase  by 
different  speakers.     Although  a  definite   learning  effect  occurred,    it  was 
found  that  the  front  vowel  /i/ — in  a  phonetic  context,   not   in  isolation — 
resulted   in  a  slightly  more  correct  speaker   identification  score 
(approximately  two  percent)   than  the   back  vowel  /a/ — also  in  a  phonetic 
context.     The  results  were  not  statistically  tested  for  significance, 
however.     The  authors   stated  that  this  difference   in  correct   identifi- 
cation may  be   accounted  for   in  consideration  of  the  high  second  formant 
of  /i/.     They  further  suggest  that  this  elevated  second  formant  frequency 
would  allow  differences    in  the  glottal  waveform  to  be  easily  perceived 
and  they  propose  that  cavity  configuration  peculiar  to  a  speaker  would  be 
more   likely  to   influence  the  spectrum  at  these  ranges. 

LaRiviere   (1971)   examined  voiced,   filtered  and  whispered  vowel 
samples   in  an  effort  to  compare  the  relative    importance  of  source  and 
vocal  tract  transfer  characteristics.     After  eight  talkers   produced 
voiced  and  whispered  samples,   a  second  set  of  voiced  samples  was  dubbed 
after  the  samples  had  been  low-pass   filtered  at   200  Hz.     Listener  perfor- 
mance was  nearly  equal  for  the  whispered  samples   (representing  vocal 
tract  characteristics)   and   low-pass  filtered  samples   (representing  source 
characteristics).     Further,   the  summed   levels   of  these  stimuli  nearly 
equaled  the  performance   levels  found  for  the  normal  productions   (the 
voiced,  non-filtered  vowels).     These   results   indicate  that  both  source 
and  formant  frequency  characteristics   are  of   importance  for  speaker  iden- 
tification.    Fundamental  frequency,   Formant  Two  and  Formant  Three  were 
similar  as   predictors  for  speaker  confusions   although  none   of  the 
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correlations  reached  statistical  significance.     The   low  vowels   /a/  and 
/  ae/  yielded  slightly  better  scores   than  the  high  vowels   /i/  and  /u/, 
with  /u/  yielding  the   lowest  score  of   all.     Further,   phoneme   intelligi- 
bility v;as   "found  to  be  neither  a  necessary  nor  sufficient  concomitant  to 
speaker  identification,"  a  finding  that  suggests   that  the   perceptual  cues 
important  for  phoneme   identification  are  not  necessarily   important  for 
speaker  identification  and  vice  versa. 

LaRiviere   (1971)   also  compared  utterances   of   (a)    isolated  phonemes, 
(b)   a  synthetic  syllable   composed  of  the  two   isolated  phonemes   and  (c)   a 
normally  produced  syllable,   all  of  equal  duration.     While   the   results 
were  difficult  to  interpret,  he   concluded,   as  did  previous  researchers, 
that  duration  is   important   in  that   ".    .    .   it  allows   listeners  to  sample 
larger  segments   of  a  speaker's   phoneme  repetoire."     Further,  he   suggests 
that   ".    .    .   this  added   information   is  based  not  on  steady  state   phonemic 
cues    •    •    •"  but  on  other  cues  which  are   integral  and   intrinsic  to 
connected  phonemes   (e.g.,   transitional  cues),   since  the  normally  produced 
syllable  resulted   in  the  highest   identification  performance. 

Thus,   previous   investigators  have  examined  various   parameters   of   the 
acoustic  production  of  speakers   in  attempts   to  determine  the   critical 
element  or  elements   involved   in  correct   listener   identification  of  a 
speaker  from  a  sample   utterance.     The  work  in  speaker   identification  may 
be  summarized  with  respect  to  the  following  categories. 

Fundamental  Frequency.     It   is   possible  to  determine  a  mean  funda- 
mental frequency  for  a  speaker  based  on  his   production  of   (1)   a  passage 
of  normal  speech — his   speaking  fundamental  frequency  (SFF)   or  (2)   an 
isolated  phoneme — referred  to  simply  as   fundamental  frequency  (fo). 
Stevens  et  al.   (1968)   selected  speakers  with  similar  SFFs  but  they  did 
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not  utilize  this  parameter  as  a  part  of  their  analysis.  In  other  words, 
while  it  is  possible  that  listeners  may  have  been  sensitive  to  small  SFF 
differences  between  speakers  and  may  have  used  the  remembered  pitch  of  a 
speaker  as  an  identification  criterion,  the  SFF  parameter  was  not 
examined  directly.  Compton  (1963)  indicated  that  the  fo  of  a  speaker's 
isolated  vowel  is  closely  correlated  with  speaker  confusions.  LaRiviere 
(1971)  concluded  that  the  fo  of  isolated  vowels  was  one  of  the  acoustic 
parameters  utilized  by  listeners  in  that  he  found  a  trend  for  speaker  fg 
similarity  to  be  correlated  with  speaker  confusions.  Thus,  both  Compton 
and  LaRiviere  concluded  that  fundamental  frequency  is  related  to  speaker 
confusions.  However,  the  relationship  between  normal  SFF  and  speaker 
identification  or  SFF  and  speaker  confusions  is  not  known.  An  obvious 
approach  to  these  questions  is  that  of  requiring  speakers  to  produce 
identical  f©  levels  so  that  this  parameter  may  be  studied  further. 

Formant  Frequencies .  Stevens  et  al.  (1968)  suggest  that  the  high 
second  formant  of  /i/  may  be  important  for  identification  of  speakers. 
LaRiviere  (1971)  reported  that  the  second  and  third  formant  frequencies 
were  related  to  speaker  confusions,  although  the  correlations  did  not 
reach  statistical  significance.  Since  the  formant  frequencies  of  a 
speaker  vary  over  several  productions  of  the  same  vowel,  though,  it  is 
possible  that  the  relationship  between  the  formant  frequencies  and 
speaker  confusions  might  be  enhanced  by  considering  several  productions 
of  a  vowel  rather  than  a  single  occurrence.  Further  research  is 
indicated  in  order  to  examine  more  closely  the  relationship  between 
formant  frequencies  and  speaker  confusions. 

Vowe Is .  Although  Stevens  et  al.  (1968)  used  contextual  vowels  which 
contained  formant  frequency  transitions,  they  did  not  examine  their  data 
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for  any  possible  effect  of  the  transitions  on  their  results.  They  did 
conclude,  however,  that  front  vowels  tend  to  yield  higher  speaker  iden- 
tification scores  than  back  vowels.  LaRiviere  (1971)  found  that  listener 
performance  was  better  for  low  vowels  than  for  high  vowels.  Bricker  and 
Pruzansky  (1966)  concluded  that  speakers  were  ranked  differently  for  the 
two  vowel  excerpts  /i/  and  /a/  and  that  speaker  confusions  were  not 
always  reciprocal.  Thus,  the  results  of  the  investigations  utilizing 
several  vowels  are  in  conflict.  However,  the  results  of  Stevens  et  al. 
and  LaRiviere  may  not  be  comparable,  since  they  did  not  use  similar 
classes  of  utterances.  Indeed  it  is  lilcely  that  isolated  vowels  are  not 
comparable  to  vowels  in  context  for  purposes  of  identification  of 
speakers  since  vowels  in  context  contain  additional  cues.  In  order  to 
further  examine  this  issue,  vowels  representing  each  of  the  extreme 
positions  of  production — high  front,  high  back,  low  back  and  low  front- 
should  be  examined  in  the  same  experimental  situation. 

Duration.  Pollack,  Pickett  and  Sumby  (1954)  observed  that 
increasing  duration  above  1200  milliseconds  does  not  lead  to  improved 
performance,  indicating  that  this  factor  can  be  controlled  by  utilizing 
sample  durations  greater  than  1200  milliseconds.  However,  the  results  of 
Compton  (1963)  suggest  that  for  the  isolated  vowel  /i/  performance  does 
not  greatly  improve  above  the  100-250  millisecond  range.  Further, 
Bricker  and  Pruzansky  (1966)  concluded,  as  did  LaRiviere  (1971)  and 
Pollack,  Pickett  and  Sumby,  that  duration  is  of  less  importance  than  the 
number  of  contextual  phonemes  available  to  the  listeners.  Although  there 
is  disagreement  regarding  the  specific  duration  beyond  which  performance 
does  not  show  further  improvement,  this  particular  parameter  can  be  elim- 
inated as  a  variable  by  using  a  duration  greater  than  the  larger  figure 


(1200  milliseconds)  noted  above. 

Listeners*  The  greatest  proportion  of  the  research  relevant  to 
speaker  identification  has  been  conducted  with  listeners  who  were  in 
daily  contact  with  the  talkers.  However,  Stevens  et  al  (1968)  utilized 
listeners  who  were  not  acquainted  with  the  talkers  but,  instead,  were 
trained  to  identify  the  speakers  by  the  use  of  a  series  of  learning 
tasks.  And  Bricker  and  Pruzansky  (1966)  reported  that  naive  listeners — 
listeners  who  neither  knew  the  speakers  nor  received  training — had  lower 
performance  scores  than  listeners  who  knew  the  speakers.  Further,  the 
naive  listeners  of  Bricker  and  Pruzansky  tended  to  make  reciprocal 
speaker  confusions  to  a  greater  extent  than  did  listeners  who  knew  the 
speakers,  e.g.,  identifying  speaker  1  as  speaker  2  and  2  as  1..  However, 
since  their  naive  group  performed  a  different  task  in  a  separate  experi- 
ment, it  is  possible  that  the  results  are  not  directly  comparable.  The 
listeners  who  knew  the  speakers  identified  them  by  name,  while  the  naive 
group  performed  a  matching  task — matching  a  reversed  sentence  to  one  of 
two  forward  sentences.  Thus,  an  investigation  is  needed  in  which  both 
types  of  listeners  are  presented  with  the  same  utterances  and  perform 
identical  tasks. 

Purpose 

In  summation,  then,   (1)  it  is  not  known  whether  listeners  who  do 
and  listeners  who  do  not  know  the  speakers  perform  similarly  in  a  speaker 
identification  task;  (2)  there  is  a  lack  of  consistency  in  the  literature 
as  to  whether  speaker  identification  is  better  for  front  vowels  than  for 
back  vowels,  or  for  high  vowels  than  for  low  vowels;  (3)  while  there  is 
an  indication  that  fundamental  frequencies  of  vowel  productions  are 
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related  to  speaker  confusions,   the  relationship  of  fo  to  speaker   identi- 
fication has  not  been  examined;    (4)   similarly,   the   relationship  between 
a  speaker's  normal  speaking  fundamental  frequency  and  speaker   identifi- 
cation has  not  as  yet  been  examined,   nor  has  the   relationship  between 
SFF  and  speaker  confusions   been   investigated? and   (5)   although  there 
appears   to  be   a  slight  trend  to  a  relationship  between  formant 
frequencies   and  speaker  confusions,   this  relationship  needs  further 
investigation. 

Therefore,   the   present   study  was  designed  to  examine   the  above 
factors   by  utilizing   isolated  vowels   of   the  same  duration,   at  the  same 
fundamental  frequencies,   as  experimental  stimuli.     The   specific  research 
questions   are   listed  below  and  in  all  cases   a  null  hypothesis  was 
utilized. 

1.  Do  listeners  who  know  the   speakers   and  listeners  who  do  not  know 
the   speakers   (a)   perform  equally  well  on  a  speaker   identification  task 
and  (b)   exhibit  similar  types   of  responses   (as  measured  by  the   following 
variables)? 

2.  Do  certain   isolated  vowels   provide  more  accurate   speaker   iden- 
tification than  other  vowels? 

3.  What   is  the  relationship  between  the  fundamental  frequencies  of 
vowels   and  speaker   identification,  when  all  speakers   produce  the 
utterances  at  the  same  specified  fundamental  frequencies? 

4.  Are  speaker  identifications  and/or  speaker  confusions,  both 
based  on  isolated  vowel  stimuli,  related  to  the  speaking  fundamental 
frequency  of  a  reading  passage? 

5.  What   is  the  relationship  between  formant  frequencies   and  speaker 
confusions  when  the  formant  values  are  the  means  of  several  vowel  pro- 
ductions? 


CHAPTER   II 


PROCEDURE 


The   present  study  was  designed  to  examine  parameters  within  the   area 
of  aural  speaker   identification.     The  first  question,  which  concerns  the 
listener  groups,  may  provide    insight   into  whether  the   two  types   of 
listeners   (one  group  of  listeners  who  know  the  speakers  and  a  second 
group  of   listeners  who  do  not)   perform  equally  well.     Further,    infor- 
mation may  be  obtained  on  whether  or  not   the   listener  groups  utilize  the 
same   acoustic  parameters    in  making  their  decisions. 

Concerning  the   second  question,  vowel  differences,  Stevens  et  al« 
(1968)   and  LaRiviere   (1971)   concluded  that   speaker   identification  differs 
according  to  the   vowel.     However,    it   is  not  yet  certain  which  particular 
vowels   or  classes   of  vowels  yield  the  highest  speaker  identification 
scores.     Four  isolated  vowels  were  therefore  chosen  in  the   present  study 
in  order  to  examine   the   influence  of  vowels. 

The  third   issue,   concerning  fundamental  frequency  of  vowels   and   its 
relationship  to  speaker  identification,   was  noted  by  Corapton  (1963)   and 
LaRiviere   (1971)   as  being  of  potential  importance  to  speaker  confusions.' 
The   basic  relationship  of  fundamental  frequency  to  speaker  identifi- 
cation, however,   is  not  known.     In  order  to  obtain  some   information 
concerning  the  relationship  of  f©  to  speaker  identification,  the  present 
study  utilized  stimuli  which  were  produced  at  specified  fundamental 
frequencies. 

16 
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With  respect  to  the  fourth  question,   the  relationship  of  speaking 
fundamental  frequency  (SFF)   to  speaker   identification  is  not  knovm.     In 
order  to  examine   this  parameter,    it  was  necessary  to  determine  SFFs  of 
the   speakers  by  analysis  of  a  connected  speech  sample.     The  SFFs  were 
then  correlated  with  the  sxperiraental  results   of   (1)   speaker  identifi- 
cation and  (2)   speaker  confusions    in  an  effort  to  determine  whether  SFF 
was  related  to  these  factors. 

In  regard  to  the  fifth  question,   concerning  the  relationship  between 
formant  frequencies   and  speaker  confusions,  LaRiviere   (1971)   concluded 
that  there    is  a  trend  for  similarities  between  speakers'   formant 
frequencies  to  be   positively  correlated  with  speaker  confusions,    i.e., 
the   closer  the   formant  frequencies  of  any  two  speakers,   the  more   likely 
they  are  to  be  confused  with  each  other.     Examination  of  the  relationship 
between  any  two  speakers*   formant  frequencies,   and  analysis   of   the  degree 
to  which  the  talkers  were  confused,   should  provide    information  about  the 
relative   importance  of  this   factor. 

Experimental  Procedure 

Subject  Selection 

Speakers.     Six  males  were   selected  from  a  group  of  prospective 
speakers  who  (1)   spoke  General  American  English,    (2)  had  no  history  of 
speech  problems   and   (3)   were  between  23  and  34  years  of  age.     Only  those 
speakers  whose  speaking  fundamental  frequency  was  within  the  range  of 
selected  vowel  fundamental  frequencies   (139-98  Hz)  were  utilized — the 
SFFs   of  the  speakers  may  be  found   in  Table   I.     It  should  be  noted  that 
while   the  SFF  values  differ  for  the   passage   and  the  sentence  stimuli, the 
rank-ordering  of  the  speakers    is  constant. 
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It  was  expected  that  speakers  with  SFFs  within  the  range  of  the 
experimental  fos  would  have   less  difficulty   in  producing  the  required 
fundamental  frequencies   than  speakers  whose  SFFs  were   outside  this  fo 
range.     Further,   this   selection  criterion  was   applied  with  the  exp)ec- 
tation  that   less  modification  would  occur   in  the  normal  or  usual  acoustic 
parameters  for  the   selected  speakers  than  for  other  speakers.     As   indi- 
cated,  all  prospective  subjects  were  requested  to  match  the   specified  fo 
levels   for  the  vowels.     Williams   (1964)   noted  that  naive   listeners   can 
learn  to  differentiate   a  maximum  of  six  speakers   and  still  perform  above 
a  chance   level.     Therefore,   the  number  of  speakers  selected  was   limited 
to  six. 

Listeners.     Sixteen  listeners  were  selected  from  a  group  of  volun- 
teers with  normal  hearing.     Listeners  were   chosen  to  fill  each  of  two 
categorical  classifications:      Group  I   (n  =  8)   consisted  of  subjects  who 
were   acquainted  with  the  speakers   and  were   presumed  to  be  familiar  with 
their  speaking  characteristics   and  Group  II   (n  =  8)   consisted  of  subjects 
who  did  not  know  the  speakers.     Although  phonemic   identification  of  any 
stimuli  was  not  required   in  the  experimental  task,   all   listeners   selected 
were  required  to  have  had  a  background  of  phonetic  or   phonemic  training. 
It  was   judged  that  the    isolated  vowel  samples  would  be   less  distracting 
to  subjects  who  had  received  such  training  since   listening  to  such 
stimuli  would  not   be  novel  to  them. 
Selection  of  Vowels 

Four  vowels,   /i,    as ,  a  /  and   /u/,  were  selected  as  representing  the 
extreme  differences   in  place  of  production  and  of  formant  frequencies. 
Isolated  vowels  were   chosen  because   (1)   research  has   demonstrated  that 
identification  of  spealcers    is   possible  from  such  stimuli  and  (2)   by  the 
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use  of   isolated  vowels    it   is  possible  to  control  or  eliminate   several 
variables   that  were  not  under  study  (for  example,   formant  frequency 
transitions).     Further,   the  four  vowels   chosen  would  allow  for  testing  of 
the  observation  of  Stevens  et   al«   (1968),   Bricker  and  Pruzansky  (1966) 
and  LaRiviere   (1971)   that  there   is   a  speaker   identification  difference 
among  vowels.     These  four  experimental  vowels  can  be  dichotomized   in  two 
waysi      (a)   front-back — /i/  and   /  aj/  versus   /a/   and  /u/  or  (b)   high-low~ 
/i/  and  /u/  versus   /  ae/  and   /a/.     These  classifications  allow  for  exami- 
nation of  gross  differences   in  formant  frequencies  on  speaker  identifi- 
cation since,   as   Peterson  and  Barney  (1952)   observed,  front  vowels  have  a 
higher  second  formant  frequency  than  back  vowels  and  high  vowels  have  a 
lower  first  formant  frequency  than  low  vowels. 
Selection  of  Fundamental  Frequencies 

Compton  (1963)   and  LaRiviere   (1971)   observed  that  fundamental 
frequencies   of  speakers   tended  to  be   correlated  with  speaker  confusions. 
Therefore   it  was  decided  to   (1)   have  all  speakers   produce   the  same   funda- 
mental frequencies   for  the  vowel  productions   and   (2)   examine  the  relation- 
ship of  fo  to  speaker   identification.     Four  experimental  fundamental 
frequencies  were   chosen  which  ware   one   tone  apartj      139,   123,   110  and   98 
Hz.     All  speakers  were  required  to  match  the  criterion  frequencies  of   139 
and  123  Hz  within  ±  3  Hz  and  of   110  and   98  Hz  within  ±  2  Hz.     Thus,  since 
all  speakers   produced  essentially  the   same  fundamental  frequencies,    it    is 
possible   to  examine  the   importance  of  f©  for   identification  of  speakers. 
Recording  of  Stimuli 

Figure  2  provides   a  diagram  of  the  equipment  used   in  the  recording 
procedure.     Recording  of  the   stimulus  material  was  accomplished  by  utili- 
zation of  a  sound-treated  room  (lAC  1204-A)   with  the   talker  seated 
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approximately  six  inches   in  front  of  the  microphone   (Electrovoice  664), 
All  samples  were  recorded  at   7,5   ips   on  a  single-track  tape  recorder 
(Ampex  351)   which  was   located  outside  the  room. 

The  speakers  were  asked  to  state  their  first  name  within  the  context 
of  the  carrier  phrase,   "My  name   is    ,    «    ."  and  to  read  from  a  revised 
version  (an  excerpt  of  about   one  minute  duration)   of  "An  Apology  for 
Idlers"  (Appendix  A)   by  Robert  Louis   Stevenson  (1906), 

Speakers  recorded  each  of  the  vowels,   /i,    ae,  a/  and  /u/,   at  funda- 
mental frequencies  of  139,    123,   110  and  98  Hz   (total  =  16  samples  per 
speaker).     Since  they  were  required  to  produce   a  specified  fo>   a  Hewlett- 
Packard  low-frequency  oscillator  (202CR),  whose  frequency  was   verified  by 
a  Hewlett-Packard  electronic  counter  (522B) ,  was  used  to  assist  them  in 
the   task  as  described  below.     Each  experimental  fo   level  was   presented  to 
the  talker  from  the  oscillator  via  an  amplifier  (Dynakit)   and  a  single 
TDH-39  earphone,  enabling  the   talker  to  match  the  frequency  by  utilizing 
beats.     The  experimenter  also  used  a  single  TDH-39  earphone  to  monitor 
the   production  and  when  the   subject  and  the  experimenter  both  judged  the 
frequency  to  be  matched,   the  vowel  was  recorded.     Subjects  were  requested 
to  maintain  a  constant   level  of  -2dB  on  the  VU  meter  of  the   tape  recorder 
while  producing  the  stimulit      this   level  was  observed  and  verified  by 
both  the  subject  and  the  experimenter. 
Acoustic  Analysis 

Measurement  of  Fundamental  Frequencies.     After  recordings  were  made 
of  the  reading  passage  and  of  the  vowels   at  the  four  fo  levels,   funda- 
mental frequencies  were  measured  frcsn  oscillographic  tracings   (Honeywell 
Visicorder,    1508A) ,     The  samples  from  prospective  speakers  were  played 
from  a  tape  recorder  (Ampex  354)   at   7.5   ips  to  the  Visicorder,     The   paper 
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speed  was  10  ips  and  timing  markers  of  .1  second  were  used.  Measurement 
of  fundamental  frequency  for  the  vowels  was  accomplished  by  dividing  the 
total  number  of  waveforms  by  the  time  involved.  For  exajnple,  if  there 
were  170  waveforms  in  a  1.7  second  sample,  the  fundamental  frequency 
would  be  100  Hz.  The  measurement  of  the  speaking  fundamental  frequency 
of  the  reading  passage  was  slightly  more  complicated.  In  this  instance 
silent  pauses  and  noise  bursts  v;ere  obviously  not  included  as  part  of  the 
measurement  of  SFF.  However,  the  total  number  of  waveform  peaks  was  still 
divided  by  the  time  involved  for  the  measured  waveforms.  This  was  accom- 
plished by  counting  the  number  of  waveforms  in  each  burst  of  speech  and 
then  totaling  the  number  of  such  waveforms.  The  total  duration  was 
measured  in  the  same  manner.  Thus,  a  mean  SFF  of  the  speech  passage  was 
derived.  As  noted  above,  the  SFF  of  the  reading  passage  was  utilized 
(1)  as  a  speaker  selection  criterion  and  (2)  for  obtaining  correlational 
data  with  both  speaker  identification  scores  and  speaker  confusions. 

The  fo  of  the  isolated  vowels  was  used  to  verify  that  all  speakers 
produced  the  four  specified  fo  levels  within  the  limits  prescribed.  All 
speakers  did  in  fact  produce  the  fo  levels  within  i  3  Hz  for  the  two 
higher  foS  and  ±  2  Hz  for  the  two  lower  fos. 

The  accuracy  of  the  oscillographic  tracings  was  measured  by 
recording  a  100  Hz  pure  tone  (Hewlett-Packard  low-frequency  oscillator, 
202CR)  whose  frequency  was  verified  by  an  electronic  counter  (Hewlett- 
Packard  522b).  a  tracing  was  then  made  (as  described  above)  and 
measured.  The  frequency  was  accurate  within  ±  .1  Hz. 

,  Spectrographic  Analysis.  Spectrographs  were  made  of  the  vowel 
samples  on  a  Kay  Electric  Sonagraph  (6061)  and  measurements  were  made  of 
the  formant  frequencies  at  the  center  of  the  energy  band  for  the  first 
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two  f ormants .     A  500  Hz  square  wave  calibration  tone  was  recorded  on  each 
spectrogram  and  was  used   in  scaling  for  frequency  measurements.     These 
measurements  were   later  used  to  examine   the  relationship  of  the  formant 
frequencies   and  speaker  confusions    in  the  data  analysis. 
Preparation  of  the  Experimental  Tapes 

Vowel  Stimuli.     The  vowel  samples   that  met  the  criteria  of  (1) 
prescribed  fundamental  frequency  (within  the  tolerance   limits  described 
above),    (2)   correct  vowel  production  (as  determined  by  the  experimenter 
and  the  speaker)   and  (3)   constant    intensity  level  (subject-controlled  and 
experimenter-monitored)   were   utilized   in  the  experimental  tapes.     The 
relative    intensity  of  all  vowel  samples  was  determined  by  measurement   on 
the   Briiel  and  Kjaer   (2305)   Level  Recorder.      Based   on  these  meajsurements, 
level  adjustment  was  made  where  necessary  in  the   tape-dubbing  procedure 
such  that   all  samples   in  the  randomized  experimental  tapes   (described 
below)   were  of  equal   intensity  (i2dB) . 

Pollack,    Pickett   and  Suraby  (1954)    found  that  durations   greater  than 
1200  milliseconds  did  not   influence  speaker   identifications.     Therefore, 
a  vowel  duration  of  1500  milliseconds  was  chosen  for  the  present  study  as 
being  sufficiently  greater  than  1200  milliseconds   to  eliminate   any 
possibility  of  duration  effects.     Further,  each  stimulus  was  spliced  from 
the   center  of   the  vowel  production  to  eliminate   any  acoustic  cues  which 
might  be   perceived   in  the  beginning  and/or  termination  of  a  speaker's 
production. 

Each  sample  was  dubbed  three   times   and  three  experimental  tapes    (each 
containing  the   same  material)  were   assembled  utilizing  three  different 
randomizations.     Three  randomizations  veve  used   in  order  to  determine 
whether  a  learning  effect  had  occurred.     The   inter-stimulus   interval  was 
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five  seconds.  A  200  Hz  square  wave  of  one  second  duration  was  recorded 
on  the  Capes  after  every  tenth  sample  so  that  identification  of  a  wrong 
saraple  could  be  quickly  detected. 
Trainins  and  Sentence  Stimuli 

As  stated,  recordings  were  made  as  each  speaker  read  an  excerpt  from 
a  modified  version  of  "An  Apology  for  Idlers."  These  recorded  excerpts 
(excepting  the  last  two  sentences  of  the  passage)  were  used  to  train  the 
listeners.  The  self-identified  recordings  of  the  passages  were  dubbed 
three  times  and  the  dubbed  passages  combined  into  three  differently 
ordered  segments.  For  the  first  presentation  the  order  of  the  speakers 
was  the  same  as  the  order  of  the  names  on  the  listeners'  answer  sheets, 
while  the  next  two  repetitions  were  in  randomized  order. 

The  last  two  sentences  of  the  passage  were  also  dubbed  three  times, 
and  the  resulting  eighteen  passage  segments  (six  speakers;  three 
dubbings)  were  combined  in  randomized  order  for  presentation  to  the 
listeners  in  order  to  obtain  a  performance  level  on  sentence  stimuli. 
Thus,  the  experimental  tapes  consisted  of  (1)  a  training  tape  (the 
reading  passage),  (2)  sentence  stimuli  and  (3)  three  separate  presen- 
tations of  the  vowel  stimuli.  The  subjects  made  speaker  identification 
judgements  for  the  latter  two  sets  of  stimuli. 
Listening  Sessions 

All  listening  sessions  were  conducted  in  a  sound-treated  room  (lAC 
1204-A)  with  listeners  seated  before  a  table.  The  experimental 
recordings  were  presented  to  the  listeners  from  an  Ampex  (351)  tape 
recorder  (located  outside  the  experimental  room)  through  a  loudspeaker 
(AR-3)  placed  in  front  of  them.  The  stimuli  were  administered  to  only 
one  or  two  listeners  at  a  time  in  order  to  hold  approximately  equal 
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loudspeaker-to-subject  distance  and  angle,   and  stimuli  were   presented  at 
a  comfortable   loudness   level  to  all  subjects,     A  visual  aid  was  provided 
in  the  experijuental  room   in  the   form  of  a  series   of  8  X  10    inch  black 
and  white   photographs   of  the   speakers  which  were  fastened  to  the  wall   in 
front  of  the  subjects   at   a  convenient  viewing  height.     These  photographs 
were    in  serial  order  from  left  to  right  so  as  to  correspond  to  the  order 
of  names   on  the  answer  sheets,   and  the  speaker's   name  was   provided   in 
large  block  letters  beneath  each  photograph.     The   photographs   and  the 
names  provided  beneath  the  photographs  were   used  as   a  visual  device  to 
assist    in  learning  the    individual  voices. 

Initially,   both  subject  groups  were   presented  the   training  material 
consisting  of  the   passages  read  by  the  speakers.     The   listeners  heard 
each  speaker  give  his  name   and   a  rendering  of  the  excerpt  from  "An 
Apology  for  Idlers,"  with  speakers  being  presented   in  a  sequence  corres- 
ponding to  the   order  of  the   answer  sheet.     Each  listener  then  heard  two 
repetitions   of  the   above  material  but  with  speakers   presented   in 
randomized  order  on  these   latter  repetitions.     For  the   training  session 
the  subjects  merely  listened  to  the   speakers. 

The   training  was   followed  by  the  sentence   stimuli  and  the   subjects 
were  requested  to   identify  the  speaker  of  each  utterance.     The   task  was 
identical  for  all  listeners:      judge  which  of   the  six  speakers  produced 
each  utterance   and  record  that   judgement  on  a  prepared  answer  sheet  by 
circling  the  name   of  the   appropriate  speaker.     The   text  of  the 
instructions   given  to  the   listeners   is   presented   in  Appendix  B  and  a 
sample   answer  sheet   is   provided   in  Appendix  C,     On  completion  of  the 
sentences,   subjects  were   presented  the  three  separate  randomizations  of 
the  vowel  stimuli  and  were  again  requested   to   identify  the   speakers. 
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After  completion  of  each  randomized  set   of  utterances   subjects  were 
allowed  a  rest  period  at  their  option. 

Data  Analysis 

Analysis   of  Variance 

A  Random  Block  Factorial  design   (Kirk,   1968)   with   listeners  nested 
in  groups  was  employed.     The  factors  examined  were  speakers,   listener 
groups,   vowels   and  vowel  fundamental  frequencies.     Each  listener  was 
considered   a  block  so  that  there  were  eight   levels  within  a  listener 
group  and  two   levels   of   listener  groups.     Further,   there  were   six  levels 
of  speakers,   and  four  levels  each  for  vowels   and  fundamental  frequencies. 
In  short,   this  design  is  a  mixed  model,   with  listeners   conside-red  to  be 
a  random  effect   and  speakers  a  fixed  effect  because  speakers  were  not 
randomly  selected   (speaker  selection  was  based  on  several  factors,   such 
as   age   and  SFF,   which  relate   to  speaker   identification). 

The   factorial  design  utilized  allows   both  main  effects  and   inter- 
actions  to  be  examined.     When  the  F  ratio  for  any  term  in  the   analysis   of 
variance  was  determined  to  be   significant,   comparisons   of  treatment   level 
means  were   carried  out   in  order  to  determine  the  relative    importance  of 
each  factor.     For  these   a  posteriori  comparisons,  Tukey's  Honestly  Signi- 
ficant Difference   (HSD)   test   was   used  (Kirk,    1968).     The  experimental 
design   is    illustrated   in  Figure  3, 
Correlational  Analysis 

In  order  to  determine  the  relationship  between  selected  acoustic 
measurements   and   identification  of  speakers,  rank-order  correlations  were 
performed   (Kendall's  tau;   Siegal,   1956).     The   acoustic  measurements  that 
were   used  were  the   speaking  fundamental  frequencies   as  determined  fro.n 
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the  reading  passage  and   the  first  two  formants   of   the   isolated  vowel 
stimuli. 

Speaking  Fundamental  Frequency.     Speaking  fundamental  frequency  was 
correlated  with   identification  judgements    in  two  ways.      In  the  first 
instance   it  was  hypothesized  that   listeners  would  be  attentive  to 
perceived  diff^^rences    in  pitch  between  the   speakers   and,   according  to 
this  hypothesis,   the  more   a  speaker's  SFF  differed  from  the  rest  of  the 
speakers  the  better  he  would  be   identified.     Therefore   the  speakers*  SFFs 
were  ranked  according  to  their  deviation  from  the  group  mean.     For  this 
ranking  the   speaker  whose  SFF  was   farthest  from  the   group  mean   (in  either 
direction)   was   given  a  rank  of   1.     The  ranking  of  SFF  deviation  was   then 
correlated  with  the  correct   identification  scores   for  each  speaker.     (All 
scores  were  based  on  the    isolated  vowel  samples  with  all  vowels   and  all 
fundamental  frequencies  pooled.) 

In  the   second  SFF  examination,  SFF  was  compared  to  speaker  confu- 
sions.    In  this    instance    it  was  hypothesized  that   the  closer  any  two 
sp)eakers  were    in  SFF,   the  more   often  they  would  be   confused  with  each 
other.     SFFs  were   correlated  with  speaker  confusions  when  (a)    all  vowels 
and  all  fundamental  frequencies  were  pooled  and   (b)   only  vowels  were 
pooled. 

Fonr.ant  Frequencies.     LaRiviere   (1971)   found  that  while  forraant 
frequencies  tended  to  be   correlated  with  speaker  confusions,   these 
relationships  were  not  statistically  significant.      It  was  decided  to 
further  test  this  relationship  by  averaging  the  forraant  frequencies  over 
four  vowel  productions   and  then  performing  rank-order  correlations 
between  these  means  and  speaker  confusions.     It   is   possible  that   if 
forraant  frequencies  are  related  to  speaker  confusions,   the   average   of 
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several  vowel  productions  would  better  approximate  a  listener's 
perception.     Accordingly,   the  means   of  the   four  productions  of  a  vowel 
were   obtained  for  the  first   and  second  forraant  frequencies.     Rank-order 
correlations  were  performed  separately  between  each  formant  frequency  for 
each  vowel  and  speaker  confusions   for  that  vowel. 

Thus,   the  data  were   subjected  to  two  separate   analyses j      (1)   the 
analysis  of  variance  and   (2)   the  correlational  analysis.     The   first 
analysis   (and  subsequent  a  posteriori  testing)   enabled  the   investigator 
to  evaluate   the   groups'   performances   in  relation  to  each  other  on  vowels, 
fundamental  frequencies   and  speakers.     This   analysis   also  allowed  for 
comparison  of  differences  within  a  parameter  within  a  group.     The 
correlational  analysis   allowed  for  examination  of  the  relationships 
between  speaking  fundamental  frequency  and   listener  responses   and  between 
formant  frequencies   and  speaker  confusions. 


CHAPTER   III 


RESULTS   AND  DISCUSSION 


The   purpose   of   the   present   study  was   to  examine   specific  parameters 
in  the   area  of  aural  speaker  identification.     The   objectives  of  the 
present   study  were   to  determine  whether  the   two  types   of   listeners   per- 
formed equally  well  for  the    isolated  vowel  stimuli  when  these   stimuli  were 
produced   at   specified   fundamental  frequencies   by  all  speakers.      It  was 
also  of   interest   to  determine  whether  ther«  was   a  relationship  between 
speaking  fundamental  frequency  and    listener  responses   and   between  forraant 
frequencies   and   listener  responses.      In  order  to  examine   the  research 
questions  the  experimenter  had  six  male   speakers  record  (a)   a  reading 
passage   consisting  of  an  excerpt  from  "An  Apology  for  Idlers"  and   (b) 
four  vowels   at  each  of  four  fundamental  frequencies.     Two  groups  of 
listeners — (I)    those  who  knew  the   speakers   and   (II)    those  who  did   not — 
were   trained  to   identify  the   speakers   by  means  of  the  reading  passage. 
They  were   then  requested   to   identify  the   speakers    for  both  sentence 
stimuli  and   the   vowel  siimples. 

Sentence  Stimuli 

Prior  to  administration  of  the  vowel  stimuli,   listeners  were 
presented  with  an   identification  task  consisting  of  sentence   stimuli. 
The   purpose   of   using  connected   speech  for  an  aural  speaker   identification 
task  was   to  determine  whether  the   naive   listeners   could,    in  fact, 
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identify  the   speakers,   at   least   in  this  easier  task. 

Table    II   shows   the   results   of  the    listener   identification   jud^,. 
on  the   sentence   stimuli.     Group  I   listeners,   who  knew  the   speakers,  ,, 
no  errors   in  identifying  the   speakers,   while  the  mean  for  Group  II   ,.. 
84   percent-correct    identification.      (It   should   be   noted  that   the 
listener  who  achieved  only  A4   percent -correct    identification  on  this 
performed  at   a  level  comparable   to  the   other  listeners   in  Group  II  ,,■ 
vowel  stimuli.     The  experimenter   is   unable  to  account  for  this   listti. 
performance  here.) 

Table   III   presents   the   results   of  the   sentence   stimuli  for  thci 
vidual  speakers   and,    in  general,    the   results   for  all  speakers  were   s^^ 
lar,   with  correct    identification  levels  ranging  from  83  to  96  percent: 
with  the  exception  of  speaker  5  whose  mean  level  was  only  58  percent— 
this   speaker,  however,  was   not   generally   identified  at  a  lower  level  ti* 
other  speakers    in  the  main  experiment.     Of   interest  also,   and  not  at  <.: 
unexpected,    is   that    identification  performance  was  markedly  higher  fc 
the   pretest  sentences   than  for  the   isolated  vowel  stimuli  for  both 
listener  groups. 

Indeed,    the   results   for  Group  I  are    in  general  agreement  with  t> 
high    identification  performance  found  by  Pollack,   Pickett  and  Sumby 
(1954)   and  LaRiviere   (1971).      (The   use   of   a  training  session  may  exp 
why  there  were  no  errors   in  the   present  study.)     The   fact  that  both 
listener  groups   obtained  a  higher   identification  performance  for  thr 
sentence   stimuli  than  for  the  vowel  stimuli   is  further  evidence  that 
contained   in  connected  speech  are   important  for   identification  of 
speakers.     Cues   such  as  rate   of  speaking,    inflection,   formant  freqi."- 
transitions   and  dialect  may  be   among  the   factors   that  are  needed  fm    ^' 
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Table   II.     Percent -correct  speaker  identification  for  the 
sentence   stimuli  by  listeners 


Group 

II* 

Listener 

Pe  roe  nt -Cor re  ct 

Id 

antif ication 

1 

89 

2 

44 

3 

89 

4 

83 

5 

89 

6 

100 

7 

83 

8 

94 

Mean  84 


*Identif ication  was   100%  for  Group  I  for  all  listeners 


Table   III.     Percent-correct  speaker   identification  for  the 
sentence   stimuli  by  speakers 


Group  II* 

Speaker  Percent-Correct 

Identification 

1  83 

2  96 

3  88 

4  92 

5  58 

6  86 

Mean  84 


*Identif ication  was  100%  for  Group  I  for  all  speakers 
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identification  performance. 

It    is   also   important   to  note   that   listeners  who  did  not  know  the 
speakers  were    able   to  correctly   identify  the   speakers   approximately  four 
times   out  of  five   and  they  were   able   to  perform  at  this    level  after 
hearing  each  speaker's   speech  for  only  two-and-one-quarter  minutes. 
Thus,    it   is  reasonable   to  suggest  that  not   only  are  the  cues  found   in 
connected   speech   important  for    identification  of   speakers,    but   that  these 
cues   are  rapidly  assimilated  by  naive   listeners. 

Vowel  Stimuli 

Preliminary  Analysis.     To  determine  whether   learning  occurred  during 
the   presentation  of  the  vowel  stimuli,   the   raw  data  for  each  group  were 
initially  examined  for  trends.     The   number  of  errors — mis  identifications  — 
for  each   listener  for  each  of  the   three  experimental  presentations  of 
vowel  stimuli   is    presented   in  Table    IV  together  with  the   total  number  of 
errors   and  mean  error  values  for  each  repetition  over  all  listeners  within 
a  group.     Since   these   data  do  not    indicate   any  trend  that  could   be    inter- 
preted  as   "learning,"  the   three  repetitions  were  pooled  and  entered   into 
the  factorial  design  for  analysis. 

Sixailarly,   because  widely  divergent   performance   on  the    part   of   any 
given  listener  could   influence   overall  group  performance  unduly,   listener 
performance  was  examined  within  the   two  groups.     These   performance  data 
are   presented   in  Table  V  and,  while  evidencing  some  variation  in  perfor- 
mance  in  both  groups,   the  data   indicate  that   listeners   within  each  group 
performed   at   comparable    levels.     For  listeners    in  Group  I  the   overall 
performance  ranged  from  61,3  to  52, A  percent-correct,  while  for  listeners 
in  the  second  group,   performance  ranged  from  14,6  to  25.7  percent-correct 
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Table  V.      Overall  percent-correct   speaker   identification  for  the 
vowel  stimuli  by   listeners,   with  speakers,   vowels   and 
fo   levels   pooled 


Group  I 


Listener 

Percent-Correct 

Identification 

1 

52.4 

2 

51.4 

3 

41.3 

4 

47.6 

5 

44.4 

6 

51.4 

7 

49.0 

8 

45.5 

Group 

II 

Listener 

Percent -Correct 

Identification 

1 

25.7 

2 

20.8 

3 

23.3 

4 

21.9 

5 

14.6 

6 

18.4 

7 

17.7 

8 

17.0 

Mean 


47,9 


Mean 


19,9 


Notej      Chance   performance   at   the   5%   level  of  confidence, 
X  =   16.7  t  4.3% 
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identification.     Only  one   listener   in  Group  II  performed  below  a  chance 
level  of   16,7  percent.     The  F  distribution  specifically  allows   for  a 
considerable  violation  of  the   assumption  of  homogeneity  of   variance  when 
dealing  with  an  equal  number  of  sjimple   observations   (Kirk,    1968,    p.   61), 
Since  visual   interpretation  of   listener  performance   levels  did  not 
indicate  heterogeneity  of  variance,   statistical  custom  was  observed  and 
a  specific  test  for  homogeneity  was  not   performed   (Kirk,   1968), 

Analysis  of  Variance.     Table   VI  presents   the  results  of  the   analysis 
of  variance   for  listener  groups,  vowels,   fundamental  frequencies   of 
vowels   and  speakers.      (Appendix  D  contains   the   actual  speaker   identifi- 
cation scores  on  which  this   analysis  was   performed.)     All  of  these  main 
effects  were   significant  at   the    .01  level  of  confidence.     Further,   all 
factor   interactions  were   significant    at   this   same    level  with  the 
exception  of   (1)   fundamental  frequencies  versus  vowels   and   (2)   groups 
versus  fundamental  frequencies   versus  vowels.     These   significant  factor 
interactions   indicate  that   listener  performance  for  any  single   factor  was 
not   independent   of  effect  from  other  factors.     Since   the  majority  of 
these   interactions  were   significant,    it   is    indicated  that   any  discussion 
of  the  main  effects   (groups,  vowels,   fundamental  frequencies  and 
speakers)   would  be   incomplete  without  reference   to  the    interactions.      In 
other  words,  discussion  of  the  main  effects   individually  would  yield 
little    information  because  each  factor  was  dependent    in  some    (as   yet 
undefined)  manner  upon  another  factor  or  factors.     Consequently,   a^ 
posteriori  comparisons   (Kirk,    1968)  were   performed  to  examine  these 
factor  interactions    in  detail  as   a  necessary  supplement  to  the   analysis 
of  variance.     Examination  of  the   results   of  these  statistical  tests 
allows  for  interpretation  of  the  relationships  among  groups,   vowels. 
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39 
fundamental  frequencies   and  speakers    in  the  following  presentation  of 
these  main  factors. 
Listener  Groups 

From  the   analysis   of  variance   summary  table   (Table  VI)    it  can  be 
seen  that  the   two  groups  were   significantly  different   in  their  overall 
performance.     Group  I  performed  significantly  better  than  did  Group  II, 
as  shown  in  Table  Vi     Group  I  obtained  48  percent-correct   identification 
and  Group  II  obtained  20   percent-correct   identification.     The   performance 
of  the   listeners  who  knew  the   speakers   compares  well  with  that  of  the 
subjects   of  LaRiviere   (1971)   who  obtained  40  percent-correct   identifi- 
cation.    The   performance   of  the   listeners  who  did  not  know  the   speakers 
was  not  significantly  above  a  chance   level  of   16,7  percent.     This   result 
indicates  that  while  the  training  was  sufficient  for  high  identification 
performance  for  connected  speech,   this  was  not   the   case  for  isolated 
vowels. 

As  noted  above,   the   factor,   groups,    interacted  significantly  with 
vowels,   fundamental  frequencies   and  speakers,   and  these   interactions   are 
considered  below.     Table  VII  provides   the   results   of  the  comparison  of 
the   two  listener  groups   in  reference   to  speakers.     When  the   interaction 
between  listener  groups   and  speakers  was  analyzed   (e.g.,   the  Group  I 
response  to  speaker  1  versus   the  Group  II  response   to  speaker   1),    it  was 
determined   that  every  speaker  was   correctly  identified  at  a  significantly 
higher  level   (<«  =   .05)    in  listener  Croup  I  than   in  listener  Group  II. 
This   finding   is   consistent  with   that   of  Bricker  and  Pruzansky   (1986) 
whose  data  indicated  that   listeners  who  knew  the  speakers  had  a  higher 
performance   level  than  listeners  who  did  not  know  the  speakers. 

Figure  4  demonstrates   that   the  ranking  of  speakers  tended  to  be 
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Table  VII,     A  posteriori  comparisons  between  the  means  for  Group  I  versus 
Group   II  by  speakers,  with  vowels   and  fo   levels   pooled 


Speaker 
1  2  3  4  5  6 

Tl   -  >rii  .71*  .36*  .71*  .81*  1.07*  2.39* 

*significant   at  5%   level  of  confidence 

(critical  difference  Q5  =   .34) 
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42 
different  for  the   two  groups — that    is,   the   two   listener  groups   differed 
in  their  ability  to  recognize   a  particular  speaker.      In  this   figure, 
each  value  represents  the  mean  (in  percent)   of  384   listener  observations. 
The  difference   (percent-correct   identification)   required  for  significance 
(=«  =   .05)    is   11,32  percent.     Since  the  groups  differed  not  only  in  their 
overall  performance  but  also   in  their  ability  to   identify  a  particular 
speaker,   it  would  seem  to   indicate  that  the   two  groups   utilized  different 
parameters   or  processes   in  making  their  decisions.     It  may  also  be  noted 
from  Figure   4   that   speakers  whose  SFFs  were    in  the   center  of  the  SFF 
range — speakers  3  and  4 — were    identified   less  well  than  were   the   other 
speakers  for  both  groups.     (Speakers   are  numbered  by  their  speaking 
fundamental  frequency!      speaker  1  had  the   highest  SFF,  while   speaker  6 
had  the    lowest  SFF.) 
Vowe Is 

Table  VIII  presents  the   comparisons   of  vowels  for  each  listener 
group.     The  data  revealed  that   for  Group  I,   /  a^/   >  /i/,   /  ae/  >  /u/   and 
/a/  >  /i/,    /a/  >  /u/    (significant   at   the    five    percent   level  of  confi- 
dence),  but   there  was  no  significant  difference  between  /  ae/  and  /a/   or 
between  /i/   and  /u/.     Further,   there  were   no  significant  differences 
between  vowels  for  Group  II,   although  /as/   tended  to  yield  higher   identi- 
fication scores  than  the   other  three  vowels. 

Figure  5  shows   the   overall  mean  percent-correct   identification  for 
vowels   and   listener  groups  and   indicates   clearly  the  result  noted  above. 
In  this  figure  each  value  represents  the  mean  (in  percent)   of  5  76  obser- 
vations,  and  significance  at  the   five   percent   level  of  confidence 
requires   a  difference  of  7,47  percent  or  greater.     The  main  difference 
between  the   two  groups  was   in  their  response  to  the  vowel  /a/  since   it 
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yielded  the  highest  performance   level  for  the  first   listener  group  but 
not   for  Group  II.     However,    only   in  Group  I  did    listener  performance 
result    in  significant  vowel  differences,    indicating  that   this   group 
utilized  cues   found   in  /  ae/   and  /a/  more  effectively  than  cues   present   in 
/i/   and  /u/,   while  Group  II  either  did  not   perceive   these   cues   or  did  not 
utilize  themt      It  should  be  noted  that   the  Group  I  performance   level  for 
all  vowels  was   significantly  higher  than  that  for  Group  11. 

To  summarize  with  respect  to  vowels,   the   two  types  of   listeners — 
those  who  knew  the  speakers   and  those  who  did  not  know  the   speakers — 
differed  significantly   in  their  overall  performance.     Specifically,  Group 
I  perfoirmed  at  a  higher   identification  level  than  did  Group  II  for  all 
four  of  the  vowels.     Further,   for  Group  I,   /  ae/  and   /a/  yielde-d  higher 
identification  scores   than  did   /i/   and   /u/,   while   for  Group  II  there  were 
no  significant  differences  within  vowels. 

The   results  found  for  the  Group  I  listeners   are   similar  to  the 
findings   reported  by  LaRiviere    (1971)   whose   subjects   also  knew  the 
speakers.     However,   the  results   of   the   present  study  are   at  variance  with 
the  findings  of  Stevens  et  al.   (1968)   who   indicated  that  the  front  vowel 
/i/  yielded  slightly  higher  performance   levels   than  the   back  vowel  /a/. 
Indeed,  Stevens  et   al.   suggested  that   because   of   its  high  second  formant 
/i/   should  result   in  higher  identification  scores  since   glottal  source 
differences   between  speakers  are   probably  readily  perceived  at  these 
higher  frequencies.     The  results  of  the   present   study  as  well  as   those 
reported  by  LaRiviere  do  not   support  this   postulate,  but   argue    instead 
that  cues  found   in  the  higher  first  formants  of  /  ae/   and  /a/  are   better 
perceived.     Whether  these  cues  represent   glottal  source  characteristics 
cannot  be   stated  at  this   time.     However,   the  vowels   in  the  Stevens  et  al. 
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study  were   produced  within  context  rather  than  in  isolation  and  perhaps 
are  not  directly  comparable   to  the   present   vowel  stimuli — the  results  may 
be   equally   incomparable.      In  all  probability  their   listeners'    identifi- 
cations were   based   on  additional  cues    includingi      (1)    cues   found    in  their 
vowel  stimuli  that   are   not   present    in  isolated  vowels,   such  as   forniant 
frequency  transitions,    (2)    phonemes   other  than  the   vowels — i.e.,   the 
surrounding  phonemes   and   (3)    cues   found   only   in  connected  speech,  e.g., 
rate   and   inflection. 
Fundamental  Frequencies 

Table    IX  shows   the   comparisons   of  fundamental  frequency   levels  for 
each  of   the    listener  groups,   while   Figure   6   presents    their  performance    in 
terms   of   percent-correct   identification.     Significant  differences 
resulted  only  for  Group  I — that    is.  Group  I   listener  performance  was 
significantly  higher  at  the  two  lower  fundamental  frequency  levels  than 
at   the  two  higher  levels:      110  Hz  >  either  139  or  123  Hz,   and   98  Hz  > 
either  139  or   123  Hz.     Neither  the   two  higher  fo  means  nor  the   two   lower 
fo  means   significantly  differed  from  each  other.      In  other  words.  Group 
I  listeners   appear  to  have   performed  a  high-low  dichotomization.     For 
Group  II  listeners,  however,   there  were   no  significant  differences 
between  any  of   the   fg  means.     Group   I   listeners   performed  significantly 
better  overall  than  Group  II  listeners. 

It  will  be  remembered  that  the   speaking  fund<imental  frequency  of  the 
speakers  ranged  from  119  to   98  Hz   (Table    I)    and  thus  did  not  encompass 
the   four  fundamental  frequency   levels   used  for  the   vowel  stimuli.     The  two 
fo  levels   (110  and  98  Hz)   which  resulted   in  a  higher   identification 
performance  for  Group  I  were  most  similar  to  the   actual  SFFs  of  most  of 
the  speakers.     Therefore,    it  may  be   inferred  that  when  the  fundamental 
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frequency  produced  deviates   from  Che    actual  SFF,    there    is    a  decrease    in 
correct   identification  of  speakers.     Or,  to  put   it   conversely,    listeners 
(who  know  the   speakers)   are  more   able   to  recognize  the   speakers   if  the 
fundamental  frequency   is   close   to  the   speakers'  SFFs. 

To  summarize  with  respect  to  the  factors  discussed  above,    it   is 
evident  that  the   two  listener  groups  did  not   perform  equally  well  on 
identification  of  speakers  from  vowel  stimuli.     They  also  differed   in 
their  responses   according  to  the   vowel  and  the  fundamental  frequency 
level.     The    listeners  who  did  not  know  the   speakers  did  not   perform  above 
a  chance   level  and   there  were  no  significant  differences    in  their  perfor- 
mances between  any  of  the   vowels   or  the   fundamental  frequency   levels, 
(The   lack  of  differentiation  on  these   parameters  may  be   attributed  to  the 
low  overall  performance  for  this   group.)      It  would  appear  that  the   amount 
of  training  given  these   subjects  was  not   sufficient  for  the    identifi- 
cation of  speakers  from  isolated  vowel  stimuli.      It    is   also  apparent   that 
high  performance   on  sentence  stimuli   is   not  a  criterion  for  determining 
whether  sufficient   learning  has   occurred  for  relatively  accurate  speaker 
identification  from  isolated  vowel  stimuli. 

However,   the   listeners  who  knew  the   speakers  did  perform  well  above 
a  chance   level  and  did  respond  differently  according  to  the   vowel  and  the 
f undcLmental  frequency   level.     Specifically,    performance  was  higher  for 
the    low  vowels,    / je/   and  /a/,   than  for  the  high  vowels,    /i/   and  /u/.      The 
two   lower  fundamental  frequency  levels   (110  and   98  Hz)   resulted    in  higher 
speaker  identification  performance   than  did  the   139  and   123  Hz  levels. 
Thus,   listeners  who  know  the   speakers   are  not   able  to  identify  the 
speakers  equally  well  under  all  conditions,    i,e.,   they  are  misled  by 
changes    in  both  the  forrnant  frequencies   (change   of  vowel)   and  fundamental 
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frequency.      It   is   important,   therefore,   to  specify  these   parameters   in 
any  study  that   involves  this   type   of   listener* 
Speakers 

From  the  analysis  of  variance  summary  table   (Table  VI)    it  may  be 
seen  that   there  was  a  significant   interaction  between  speakers  and  vowels 
and  between  speakers  and  fundamental  frequency  levels. 

Vowels.     Table  X  presents   comparisons  between  vowels   for  each 
speaker  with   listener  groups   and  fundamental  frequencies   pooled.     Exami- 
nation of  the  data  revealed  the   same   trend  found  within  Group  I  with  / ae/ 
and  /a/  tending  to  yield  better  listener  performance   than  /i/   and  /u/« 
Not   all  speakers  demonstrated   this   trend    in  listener  performance,   but 
where   significant  differences  occurred,   speaker   identification  perfor- 
mance was  generally  higher  for  the   low  vowels   than  for  the  high  vowels, 
and   /a/  tended  to  yield  the  highest  performance   level  of  all  vowels. 

Fundamental  Frequency.     Table   XI  presents   the   a  posteriori  compari- 
sons  between  the  f©  levels   for  each  of  the   speakers,   with  both  groups   and 
vowels   pooled.     In  these  comparisons   there  were   significant  differences 
in   identification  scores   between  the  fo  levels  for  four  of  the   speakers i 
speakers   1,   2,   3  and  6.     For  one  of  the  speakers  with  a  relatively  high 
SFF   (speaker  2),   the  higher  fo  levels  generally  yielded  significantly 
better   identification  scores   than  the   lower  fo  levels.     The  reverse  was 
true   for  speaker  3  where  the   lowest  f©  level  yielded  significantly  better 
performance  than  the  two  highest  fo  levels.     For  speaker  6  also  the   lower 
fo   levels  yielded  significantly  higher   identification  scores,   since 
significant  differences   (five  percent   level  of  confidence)   were  found  for 
five   of  the  six  fo  comparisons   (the  exception  is   110  versus   98  Hz).     In 
general  these  results    indicate  that,  where  significant  differences 
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occurred  between  the  fo  levels,   the   f©  level  yielding  the  higher  speaker 
identification  score  was   somewhat  dependent  upon  the   speaker's  SFF.      In 
other  words,   there  was  a  trend  for  those   speakers  who  had  the  higher  SFFs 
to  be    identified  better  when  the   fo  level  was  higher,  while   the  reverse 
was  true  for  speaker  5,     However,   these  results  did  not  occur  for  all 
speakers   and  did  not  occur  for  every  fo   level  for  those   speakers  where 
significant  differences  were   obtained.     But   the  data  do  suggest  that 
there    is   a  relationship  between  a  speaker's  SFF  and  how  well  he  was 
identified   at  any  fo  level.     Examination  of  the  SFFs  and  speaker   identi- 
fication and  SFFs   and  speaker  confusions  yield  further  support  for  this 
relationship.     These  factors   are  discussed   in  the  next   section. 

Acoustic  Analysis 

Speaking  Fundamental  Frequency 

Deviations  from  the  Mean  SFF.     A  speaking  fundamental  frequency  was 
determined  for  each  speaker  based  on  his  reading  of  the   "Apology" 
passage.     From  these  data  an  SFF  mean  for  all  speakers  was  obtained.     The 
speakers  were  rank-ordered  according  to  their  absolute  deviations  from 
this  mean  SFF  with  the   speaker  who  had  the   largest  deviation  given  a  rank 
of   1.     This   procedure  was  carried  out   in  an  effort  to  determine  whether 
or  not   listeners   used  pitch  differences  of   the   speakers   as  a  means   of 
identifying  them.      If  SFF  was   utilized   in  this  manner,   speakers  with  the 
greatest  deviation  from  the  mean  would  be  most  easily   identified.     An 
equal-interval,    linear  scale  was   chosen  to  describe   the  SFF  values   since 
it  was   judged  that  this  representation  would  more   closely  relate   to  the 
perceptual  parameter  of   pitch.     Thus,   the   semitone  values   of  the  SFFs 
were  used  for  the  correlational  analysis. 


54 

Table  XII  presents  the   speakers'  SFFs   in  both  Hertz  and  semitones, 
as  well  as  deviations  from  the  mean  and  the  predicted  rankings,   and 
provides   the   actual  ranking  of   speakers  for  each  listener  group.     For  the 
latter  rankings  the   speaker  who  was   correctly  identified  the   greatest 
number  of  times  was  given  a  rank  of  1.     For  Group  II  the   correlation 
between  the   two  rankings  was   significant  at  T  =   .87  (p  =   .008),  while  for 
Group  I  a  non-significant  correlation  of  7  =   .33   (p  =   .281)   was   obtained. 

These  results   are  given   in  graphic  form  in  Figure   7,     For  Group  II 
the  only  reversal  (predicted  versus  observed  ranking)  was  for  speakers   1 
and  2,  whereas  for  Group  I  only  speakers   3  and  4  retained  the  same 
ranking.     The  high  correlation  for  Group  II  between  deviations  from  the 
mean  SFF  and   identification  performance   suggests   that   in  this   group, 
listeners   utilized  pitch  differences  between  speakers    in  making  their 
judgements.     Although  the  naive  group  did  not  perform  above  a  chance 
level,    it  would  appear  that   those  speakers  who  were   best   identified  were 
those  whose  pitches  were  most  distinctive   to  the   listeners   (i.e.,   those 
who  were  most  divergent  from  the  mean  SFF). 

SFF  and  Confusion  Errors.     Correlations  between  a  speaker's  ranking 
for  SFF  (in  semitones)   and  the   overall  confusion  errors,   and  between  SFF 
ranlcing  and  confusion  errors   by   individual  f©  levels,  were  generally 
insignificant.     These  results    indicate  that  speaker  confusions  were  not 
consistent  within  either  parameter.     While   the   statistical  analysis 
established  that  speaker  1,  for  example,   was  not  significantly  mis  iden- 
tified as   speaker  4  over  all  conditions,   the  analysis  did  not  preclude 
the  existence   of  patterns   of  mis  identification. 

Examination  of  the   overall  confusion  percentages  provided   in  Table 
XIII  revealed  that   indeed  there  was  a  trend  for  confusions   to  relate  to 
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Table  XIII.     Overall   speaker  confusion  percentages   for  Group  I  I 


and  Group  II,   with  vowels   and  fo   levels   pooled 


Group  I 

Actual 

Speaker 

1 

2 

3 

4 

5 

6 

1 

65 

33 

21 

17 

16 

CD    U    Z 

27 

23 

14 

14 

12 

•H  ^  3 

33 

16 

17 

11 

18 

o  o  A 

20 

8 

14 

31 

29 

Q)  to  5 

12 

11 

16 

21 

25 

^   6 

9 

0 

9 

27 

26 

Group  II 
Actual  Speaker 


1 

QJ  P   2 

>  <y  . 

•H  -:  3 

O  W    4 

j-i  5. 

a;  <ji  5 

'^  6 


32 

22 

18 

24 

13 

32 

29 

18 

17 

21 

19 

17 

24 

21 

23 

17 

24 

21 

25 

20 

18 

14 

12 

16 

22 

14 

13 

16 

22 

14 
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SFF  similarities.     The  data  suggest  that  a  speaker  generally  tended  to  be 
confused  with  speakers  vrtiose  SFFs  were   close   to  his. 

Where  Table  XIII  allows   inspection  of   actual  versus   perceived 
speaker,  Table  XIV  provides   percent-confusion  data  for  the    individual  fg 
levels  for  each  listener  group.     At  fo  =  139  Hz,  for  example,    in  25 
percent  of  the  misidentif ications  speaker   1  was   the   perceived  speaker,    in 
26  percent  of  the  errors   speaker  2  was   perceived,  etc,     (Since   percen- 
tages were  rounded  to  whole  numbers   a  column  will  only  approximately 
total  100  percent.)      It   is  apparent   in  these  data  that,   generally,   the 
speakers  with  the  higher  SFFs  were  mistakenly-perceived  when  errors 
occurred  at  the  higher  fo  levels.     Similarly,   speakers  with  the   lower 
SFFs   generally  were  mistakenly  identified  at  the   lower  fo  levels. 
Speaker  6,  for  example,  who  had  the    lowest  SFF,  was   seldom  the   perceived 
speaker  at  the   139  Hz   level.     However,   as   the   stimulus  frequency  (fo)  was 
presented  at  progressively  lower  levels,  he  was  mistakenly- perceived  a 
greater  percentage  of  the   time. 

In  light  of  the  trends   noted  above,    it  was  considered  particularly 
important   to  examine   speaker  confusions    in  greater  detail.     Tables  XV  and 
XVI  present   spaaker  confusion  percentages  for  both   listener  groups   at 
each  fo  level  and  show  both  actual  and  perceived  speakers  with  the 
percentage  of  confusion  between  them.     In  Table  XV  at  fo  =  139  Hz,   for 
example,   of  the   total  number  of  times  that  speaker  1  was  misidentif ied, 
he  was    identified  40  percent   of  the   tLme  as   speaker  2,    9  percent  of  the 
time   as   speaker  3,   31  percent  as   speaker  4,  etc.     As  noted  above,  since 
percentage  values  have  been  rounded  to  whole  numbers,   any  column  will 
total  only  approxLmately  100  percent.      Ic   is   apparent   in  these  data  that 
both  trends — misidentif icat ion  as   a  speaker  whose  SFF   is  similar  or  as  a 
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Table  XIV,     Mistakenly-perceived   speakers    in  percentages   for  each  f^ 
level   for  Group  I   and  Group  II,   with  vowels    pooled 


Group  I 

Group 

II 

fo 

(Hz) 

fo 

(H 

z) 

139 

123 

110 

98 

139 

123 

110 

93 

1 

25 

32 

33 

22 

1 

25 

19 

16 

12 

"S  p2 

26 

16 

9 

8 

0)  M  ^ 

30 

21 

19 

11 

.^2  3 

8 

19 

17 

18 

>    O     1 

9 

20 

19 

21 

0)  rJ  /. 

O  QJ  4 

19 

17 

11 

13 

(1)  rt  /, 
O  O  ^ 

19 

18 

20 

16 

Ma. 
a  5/)  J 

12 

11 

lA 

19 

0)  CO  J 

11 

13 

13 

17 

^   6 

10 

5 

16 

21 

"^   6 

6 

9 

15 

23 

Note:      Speakers   are    listed    in  order  of  descending  SFF 
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62 
speaker  whose  SFF  tended  toward  the  stimulus  fo  level — are  present.  A 
graphical  presentation,  however,  demonstrates  these  patterns  more 
clearly. 

Data  from  Table  XIII  are  presented  Graphically  in  Figure  8,  which 
shows  speaker  confusions  for  Group  I  and  Group  II  with  all  vowels  and  all 
fo  levels  pooled.  However,  only  those  identification  errors  which 
account  for  at  least  25  percent  of  the  total  number  of  errors  have  been 
included.  Indeed,  selection  of  25  percent  as  the  cut-off  value  was  based 
on  the  assumption  that  values  at  or  above  this  level  would  tend  to  better 
indicate  major  trends  in  the  listeners'  performance  while  lower  values 
could  instead  reflect  random  occurrences.  As  will  be  detailed  below, 
this  assumption  appears  to  have  been  correct.  In  Figure  8  the  direction 
of  the  error  is  indicated  by  the  arrowj   in  Figure  8a,  for  example,  when 
speaker  1  was  raisidentif led  by  Group  I,  he  was  misidentif ied  as  speaker  2 
for  27  percent  of  the  time  and  as  speaker  3  for  33  percent  of  the  time. 

From  the  diagram  it  is  evident  that  the  speakers  were  divided  into 
two  subsets  by  Group  I  (Figure  8a) i  the  three  speakers  with  the  highest 
SFFs  comprised  one  subset  and  the  speakers  with  the  lowest  SFFs  formed 
another  subset.  Further,  it  is  clear — as  it  was  for  the  Bricker  and 
Pruzansky  (19u6)  data  in  Figure  1 — that  confusion  errors  occur  within  a 
subset  and  not  merely  within  pairs  of  speakers.  Generally,  speakers  with 
the  highest  SFFs  were  confused  with  each  other  while  the  three  speakers 
with  the  lowest  SFFs  were  most  often  confused  with  each  other  by 
listeners  in  Group  I.  In  Figure  8b,  which  presents  Group  II  confusions, 
there  appears  to  be  no  persistent  trend.  Indeed,  there  are  only  four 
confusions  that  are  25  percent  or  greater.  However,  the  confusions  which 
represent  less  than  25  percent  also  show  that  overall  confusion  errors 
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© 


b.  Group  II 

Figure  8.  Group  I  and  Group  II  speaker  confusion  percentages  of 
25%  or  greater,  with  vowels  and  f©  levels  pooled 


64 
were  not  concentrated   in  subsets   as  was  found  for  Group  I.     Generally, 
then.  Group  II  did  not   consistently  confuse  particular  speakers  but, 
since  the  overall  Group  II  performance   on  the  vowel  stimuli  did  not  rise 
above   a  chance   level,  this   occurrence   is  hardly  surprising.     In  view  of 
the   partitioning  noted  for  Group  I,  however,  examination  of  the 
confusions   by  individual  fo  levels   is   necessary  toward   identifying 
specific   identification  trends. 

Figure   9  presents   speaker  confusions  for  Group  I   listeners,    indi- 
cating the  percentage  of  confusions   (25  percent  or  greater)   and  the 
direction  of  the   confusions.     For  example,    in  Figure   9a,   speaker  1  was 
identified  as   (or,   confused  with)   other  speakers   a  certain  number  of 
times.     In  31  percent  of  these  misidentif ications ,  he   was   confused  with 
speaker  4,   and   in  40  percent  of  the  misidentif ications ,  he  was   confused 
with  speaker  3.     The  remaining  29  percent  of  the   speaker  confusions   (for 
speaker   1)   were  directed  toward  at   least  two  other  spjeakers   and  are  not 
plotted  since,    individually,   no  other  confusion  attained  the   25  percent 
cut-off  value. 

In  Figure    9a,   for  fo  =  139  Hz,    a  high-low  SFF  dichotomization   is 
still  somewhat   in  evidence,   as   noted  for  the  overall  confusions.     The 
perceived  speaker  was   generally  one   who  had  a  similar  but  higher  SFF  than 
the  actual  speaker,  with  the  exception  of  speaker  1,   of  course..   In 
Figure   9b,   for  f©  =  123  Hz,   the  three  speakers  with  the  highest  SFFs 
appear  to  comprise   a  subset,   and  speaker  confusions  at  this  fo  level 
generally  tended  toward  a  speaker  within  this  subset.     For  fo  =  HO  Hz, 
Figure   9c,  Group  I  listeners   again  performed  a  high-low  SFF  dichotomi- 
zationi      confusions   occurred  between  speakers  whose  SFFs  were   adjacent 
and  within  the  high  or  low  SFF  subset.     More  reciprocal  confusions   (i.e.. 
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a.   f^  =  139  Hz 


b.  fo  =  123  Hz 


45 


66 


44 


^ 


c.   fo  =  110  Hz 


d.   fn  =  98  Hz 


Figure  9.  Group  I  speaker  confusion  percentages  of  25%  or 
greater  by  f©  levels,  with  vowels  pooled 
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between  pairs   of  speakers)   appeared  at  this  f^  level  than  had  previously 
been  noted.     Figure   9d  shows  the  major  confusions  for  fo  =  98  Hz,   and  the 
high-lov?  SFF  dichotoraization  is  again  in  evidence,  with  the  exception  of 
speaker  3, 

To  summarize  with  respect  to  Group  I»      it  would  appear  that  the      ' 
group  of   listeners  who  knew  the  speakers  performed  a  dichotoraization  of 
speakers   into  a  high  SFF  and   a  low  SFF  group.     These   listeners   tended 
generally  not   only  to  confuse  speakers  who  \rere   close  to  each  other  in 
SFF  but,   further,    tended  to  make   the  majority  of  their  confusions  within 
an  SFF  high  or  low  subset.     Overall,   the  fo  level  appeared  to  have   little 
influence  on  confusions  for  Group  I  listeners. 

A  corollary  analysis  was   performed  for  Group  II,  even  though  their 
overall  vowel  performance  was   at  a  chance   level,   and  certain  trends  were 
detected.     This   confusion  analysis    is   presented   in  Figure   10.     For  fo  = 
139  Hz,   Figure   10a,  speakers   1  and   2  comprise   a  subset  with  which  the 
other  speakers  were  often  confused.     These   two  speakers,   as  noted,  had 
the  highest  SFFs.      In  Figure   10b,   fo  =  123  Hz,   Group  II   listeners 
responded  similarly  to  Group  I   listeners:     the  three  speakers  with  the 
highest  SFFs  were  often  the   perceived  speakers.      In  Figure   10c,   the   110 
Hz  fo  level,   speaker  confusions   generally  tended  to  have   been  between 
speakers  with  similar  speaking  fundamental  frequencies.     For  the   98  Hz  fo 
level.   Figure   lOd,   there    is   a  slight  high-low  SFF  dichotoraization  for 
Group  11  listeners,  with  speakers   3  and  6  most  often  mistakenly  perceived. 

To  summarize    in  regard  to  Group  II,   these   listeners'   judgements 
appeared  to  be   influenced  to  a  considerable  degree  by  the  f©  level,   as 
described  below.     The   possible  exception   is   the   110  Hz  level  (Figure   10c) 
which  appears   to  have  been  considered  rather  neutral.     If,  however,   the 
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a,   fo  =  139  Hz 


d,  fo  =  98  Hz 


Figure  10.  Group  II  speaker  confusion  percentages  of  25%  or 
greater  by  fg  levels,  with  vowels  pooled 
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fo  levels   are  viewed  sequentially  from  139  to  98  Hz,   the  tendency  of 
these   listeners'   confusions  readily  becomes  apparent j      judgements   tended 
toward  the   speakers  with  the  higher  SFFs   at  the  higher  fo  levels   and 
toward  the  speakers  with  the   lower  SFFs   at  the   low  fo  levels. 
Additionally,   at  the   110  and  98  Hz  fo   levels—which  are  closest   in 
frequency  to  the  speakers'   actual  SFFs — there   is   indication  of  another 
judgement  criterion  being  applied!      listeners  tended  to  confuse   speakers 
who  were  actually  close   together  in  SFF. 

Thus,   both  listener  groups   appear  to  have   utilized  pitch   in  making 
their   identification  judgements.      It  would  seem,    then,   that  Group  I 
listeners  were  relatively  stable   in  their  decision  behavior   in  that  this 
group  tended  to  divide   the   six  speakers   into  two  groups  of  three  each: 
three  speakers  with  high  pitches   and  three  with   low  pitches.     Further, 
this  group  was   little   influenced  by  the  changing  f©  levels.     The   second 
group,  however,   composed  of   listeners  who  did  not  know  the  speakers, 
perceived  speakers  whose  SFFs  were  highest  when  the  fo  level  was  high  and 
tended  toward  confusions  with  speakers  with  lower  SFFs   at  the  two  lower 
fo  levels.     For  the  two   lower  fo  levels,   also,   the   perceived  speaker 
often  was   one  having  an  SFF  similar  to  that   of  the  actual  speaker. 

Therefore   it   is    indicated  that   identification  of  speakers    is  depen- 
dent at  least   in  part   upon  a  speaker's  normal  SFF  and   its  relationship  to 
the  other  speakers'  SFFs,     It  appears   almost  as    if  listeners   generate — or 
possess — an   internal  matrix  for  pitch  along  which  speakers   are   positioned 
according  to  their  relationship  with  other  speakers.     However  the  pitch 
relationship  is  maintained — no  statement  could  be   other  than  speculative 
at  this  point — ,   once   listeners  have  established  this  relationship,   they 
appear  to  use   it  as  one   criterion  for  making  identification  judgements. 
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The  two  groups  responded  differently  to  the  cue  of  pitch,  however,  with 
listeners   in  Group  I  apparently  able   to  maintain  a  more  stable  concept  of 
each  speaker's    pitch.     Considering  that  Group  II   performed  at   a  chance 
level,  however,    it  may  alternately  be   assumed  that   listeners   in  Group  II 
did  not  achieve  a  stable   pitch  relationship  between  speakers. 

Regardless  of  the  dissimilar  degree   of  performance  between  the 
listener  groups,   the   tendency  toward  similarity   in  kind  of  performance   is 
striking  and,   overall,   these  results  seem  to   indicate  that  SFF   is  related 
to  speaker  confusions  for  both  listener  groups j      (1)   a  speaker  was 
confused  with  another  speaker  whose  SFF  was   similar  to  his   or  (2)   a 
speaker  was   confused  with  a  speaker  whose  SFF  was   similar  to  the  fo  level 
produced.     Although  the  fact  that  there  was   a  relationship  between  two 
speakers'  SFFs,   the  fg  level  and  confusions  does  not   account  for  all  the 
confusions,    it   is    indicative  that  both  SFF  and  f©  are  related   to  speaker 
identification.     It  can  be  suggested  that  the   use   of  specified  fo  levels 
had  somewhat  of  a  confounding  effect   upon  the   listeners   and  that  the 
effect   is  different  for  the   two  types   of   listeners.      For  Group  I  the 
relationship  between  SFF  and  speaker  confusions    is  relatively  stable, 
i.e.,  not   influenced  by  fo  level.     However,   the  relationship  between  SFF 
and  speaker   identification   is  dependent  upon  the  fo  level  producedi      the 
closer  the  fo  level  to  the   actual  SFFs,   the   better  the    identification. 
Thus,  while  Group  I  obtained  fewer  correct  speaker  identifications   at  the 
higher  fg   levels,   these   listeners  remained  relatively  stable   in  their 
types  of  confusions.     The  opposite  was   true  for  Group  III      these 
listeners  were  stable   in  their  overall  identification  performance  by  fo 
level,  but  were   less  consistent  for  speaker  confusions. 
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SFF  and  Speaker  Identification  Judgements,  Correct  and  Incorrect. 
In  view  of  the  apparent  tendency  for  listeners'  identification  judgements 
(of  a  speaker)  to  vary  directly  with  the  fundamental  frequency,  it  was 
decided  to  statistically  analyze  the  data  with  reference  to  the  extreme 
SFFs  and  f^  levels  utilizing  a  chi-square  statistic  (Kirk,  1968). 
Speakers  were  therefore  pooled  into  three  groups:   the  two  speakers  with 
the  highest  SFFs,  the  two  lov.-est  and  the  two  in  the  center  of  the  SFF 
range.  A  mean  was  taken  of  the  identification  judgements  for  each  of  the 
speaker  pairs  and  this  value  was  utilized  for  the  analyses.  Since  any 
trend  relating  to  fo  and  SFF  would  be  most  evident  at  the  extremes  of  the 
fo  range,  data  from  only  two  levels  were  used  for  the  analyses:   the 
highest  and  the  lowest  f©  levels.   Separate  analyses  were  performed  for 
correct  speaker  identification  judgements  and  for  mistakenly-perceived 
speakers,  and  the  responses  of  the  two  listener  groups  were  analyzed 
separately.  The  trend  noted  in  the  preceeding  section  for  a  speaker  with 
a  high  SFF  to  be  perceived  at  a  high  f^  level — or  low  SFF  at  a  low  fo 
level — could  thus  be  subjected  to  a  rigid  test. 

The  results  of  these  analyses  are  given  in  Tables  XVII  and  XVIII. 
Where  a  significant  chi-square  value  occurred,  the  data  were  further 
analyzed  by  treating  two  speaker  pairs  at  a  time  utilizing  separate  chi- 
square  analyses.  These  results  are  also  given  in  Tables  XVII  and  XVIII. 

Group  I  correct  identification  judgements  (Table  XVII)  for  speaker 
pair  1,2  were  significantly  different  from  judgements  for  other  speakers. 
For  Group  II,  however,  while  correct  identification  of  speaker  pair  1,2 
differed  from  identification  of  speaker  pair  5,6  at  both  f^,  levels, 
speaker  pair  3, A  did  not  differ  from  either  of  the  other  pairs.  For  the 
mistakenly-perceived  identification  judgements  (Table  XVIII)  for  Group  I, 
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an  insignificant  chi-square  was  obtained   indicating  that  the  speakers 
were   not   perceived  more  or  less  well  according  to  the  fg  level.     For 
Group  II,   perception  of  speaker  pair  1,2  was   significantly  different   from 
that   of  speaker  pair  5,6,   but  speaker  pair  3,4  was   not  differentiated 
from  the  other  speaker  pairs.     A  discussion  of   these  results   is  given 
below. 

For  Group  I  the  four  speakers  with  the  extreme  SFFs  were   correctly 
identified  to  a  greater  or  lesser  degree  depending  on  whether  or  not  the 
SFFs  were   similar  to  the   fo  level.     For  example,   speaker  pair  1,2  (with 
the  highest  SFFs)   was  more  correctly   identified  at  the    139  Hz   level  than 
at  the   98  Hz  level.     While  the  data   indicate  that  speaker  pair  3,4  was 
better  identified  at  the   lowest  fo  level,    it  may  be  seen   in  Table  XVII 
that   this   trend  was  much  larger  for  speaker  pair  5,6,     Thus,   at  the 
lowest  fo  level  speakers  5  and  6  were  more   often  correctly  identified  than 
speakers   3  and  4,  who   in  turn  were  more   often  correctly   identified  than 
speakers   1  and  2,     From  the   analyses   of   the  mistakenly-perceived 
speakers   it  may  be   seen  that  the  differences  were  not   statistically 
significant  for  the   acquainted   listeners   (Table  XVIII).     This  result 
supports   the   trend  noted  above,   that  this   group  was  more  stable   in  its 
confusions   than  was   the  naive   group.     However,   theire  was   a  slight  trend 
for  the  speakers   to  be  more  often  perceived   incorrectly  when  the   fo  level 
was  similar  to  the  SFFs,    i.e.,   the  trend  noted   iraiaediately  above. 

Group  II  appears   to  have  been  influenced  by  the  relationship 
between  the  SFFs   and  the  fo  level  for  both  correct   identification  and 
mistakenly-perceived   judgements.     When  the  f©  level  was  high,   the 
speakers  with  the  highest  SFFs  were  most  often  perceived  (both  correctly 
and   incorrectly)   and  vice  versa.     Identification  judgements   of  speakers 
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3   and  k  occurred  between  the   two  extremes,    although  they  tended  to  be 
more  often  perceived  at   the   lowest  fo  level. 

It   is    interesting  to  note  that  for  both  listener  groups  there  was   a 
tendency  for  speakers    in  the  two  pairs   1,2  and  5,6  to  be  perceived  ' 
approximately  two  times  more   often  when  the  fo  level  and  the  SFFs  were 
similar.     Although  the   groups  differed   in  absolute  value,   this   approxi- 
mately two-to-one  ratio  was  noted  for  both   listener  groups  for  both  the 
correct   identification  and  the  mistakenly-perceived  data.     As   a  single 
example   of   this   tendency,    it  may  be   noted   that   speaker  pair  5,6   (with 
the    lowest  SFFs)   was   perceived  approximately  twice   as   often  at   the    lowest 
fo  level  (98  Hz)   as   at  the  highest  fo   level  (139  Hz). 

These  analyses,   then,   tend  to  validate   statistically  the  trends 
noted  above,  derived  from  analysis   of  speaker  confusions,   and  further 
support   the   suggestion  that   a  listener's   speaker   identification   judge- 
ment   is    influenced  by  the   perceived   and  remembered   pitch  of  each  speaker. 
When  all  speakers   produced   the   same   fo   levels,   as    in  the   present  study, 
the    listeners   appear  to  have   based  their   judgements,    at   least    in  part,   on 
the   closeness   of   the   remembered  SFFs   to  the   fg   level. 
Formant  Frequencies 

The  first  two  formant.  frequencies  were  measured  from  a  spectre- 
graphic  representation  of  each  vowel  production.     (It  must  be  remembered, 
however,   that  spectrograms  do  not   allow  for  extremely  fine  measurement.) 
The   speakers'  formant  frequencies  within  a  vowel  were   generally  similar, 
i.e.,   no  speaker's  formant  frequencies  were   consistently  deviant  from 
those  of  the  other  speakers.     It  was   also  evident   from  preliminary 
analysis   that  the  formant  frequencies  did  not  vary  systematically  with 
fundamental  frequency,   e.g.,   the  formant  frequencies  did  not  consistently 
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increase  with  an   increase    in  fundcimental  frequency.      Consequently,   mean 
formant   frequencies  were  derived  for  Forraant  One   and  Formant  Two  for  each 
speaker  and  for  each  vowel  with  fg  levels   pooled.     These  mean  values   are 
provided    in  Appendix  E.     Rank-order  correlations  were   performed   between 
speakers'   formant  frequency  ranking  and   speaker  confusions  for  each 
listener  group,   for  both  Fl  and  F2  (Appendix  F) ,     The   correlations  were 
generally   insignificant  with  significant  correlations   noted  randomly  and, 
apparently,    indicative   only  of  chance   occurrence.     Further,  no  trend  was 
apparent  within  either  listener  group  for  formant  frequency  to  offer 
predictive   validity   in  regard   to   speaker  confusions. 

This   finding   is   counter  to  that  of  LaRiviere   (1971)  who  reported 
that   F2  appeared  to  be   a  good  predictor  for  speaker  confusions.      It 
should  be   noted,  however,   that  the  correlations  found   by  LaRiviere 
between  F2  and  speaker  confusions   failed   to  reach  significance   at  the    ten 
percent   level  of  confidence.     Further,  LaRiviere 's   speakers  were  free   to 
choose   the  generated  fg  level  for  the  voiced  vowel  stimuli  and  may  be 
presumed  to  have   selected  an  fo  level  comparable   to  their  normal  SFF,     In 
the   present  study,   on  the   other  hand,   each  speaker  was  forced  to  generate 
prescribed  fo   levels,  with  a  maximum  of  one  fg   level  (of  four) 
approaching  the   speaker's   own  SFF.      It   appears   that  whatever  usefulness 
Che   second  formant  frequency  might  have  had  for  predicting  speaker 
confusions  for  productions   at  or  about   a  speaker's  SFF  was   lost   in  the 
process  of  generating  other  fo  levels. 


CHAPTER  IV 

SUMMARY  AND   CONCLUSIONS 

The   present   study  was   conducted    in  the   area  of  aural  speaker   identi- 
fication.    The   purpose   of   this    study  was   to   investigate   particular 
parameters  relating  to  an  aural  speaker  identification  task  and   in 
particular   (1)   to  determine    if   listeners  who  knew  the   speakers   and 
listeners  who  did  not  know  the   speakers   (a)   performed  equally  well  and 
(b)   responded   similarly;    (2)    to  determine   whether   listener  performance 
was  the  same  for  all  vowels;    (3)   to  establish  the  relationship  between 
speaker   identification  performance    and   controlled  fundamental  frequencies 
of  vowels;    (A)   to  establish  whether  or  not   a  relationship  existed   between 
either  SFF  and  speaker  identification,   or  SFF  and  speaker  confusions,   or 
both;   and   (5)   to  determine  whether  formant  frequencies  were  related  to 
speaker  confusions  when  one  formant  frequency  was  considered  at   a  time. 

The   subjects   for  this   study  consisted   of   (a)    six  male   speakers  who 
had  SFFs  within  the  desired  range   and  who  could  produce   specified  f© 
levels  and   (b)    sixteen  listeners,  eight  of  u-hora  knew  the  speakers   and 
eight  who  did  not  know  the   speakers. 

After  all  of   the   listeners  were   given  training   in   identifying  the 
speakers,   they  were   presented  with  a  short  passage   and  requested   to 
circle,   on  their  answer  sheet,   the  name   of  the   speaker  they  perceived  for 
each  sample.     After  the   training  session,   the   sentence   stimuli  were 
presented  to  the   listeners.     Then  the   four  vowels,   /i,    ae,  a/   and  /u/,   at 
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each  of  four  fundamental  frequencies,   139,    123,    110  and  98  Hz,   were 
presented  to  both  listener  groups. 

An  analysis   of  variance  was  performed  on  the    listeners*  responses    in 
order  to  determine   the  relationships   between  listener  groups,   vowels, 
fundamental  frequencies   and  speakers.     Appropriate   a  posteriori  compar- 
isons were  performed  where  significant   interactions  occurred   in  the 
analysis   of  variance. 

Additionally,   acoustic  analysis  was  performed  on  all  vowel  stimuli 
and  the   first  two  forraant  frequencies  were   obtained.     These   parameters 
were  utilized  to  predict  speaker  confusions  and  were  correlated  with 
actual  confusions.     All  correlations  were  performed  with  pooled  fo 
levels  J      thus,   each  formant  frequency  was   actually  a  mean  formant 
frequency. 

Further,   speaker  deviations  from  the  mean  SFF  were  correlated  with 
mean  percent-correct   identifications,   and  ranking  of  speakers   by  SFF  was 
correlated  with  speaker  confusions. 

The   primary  results   of  this   study  may  be   summarized  as   follows i 

1.  The  sentence   stimuli  resulted   in  the  highest  speaker   identifi- 
cation performance  of  all  the   stimuli  for  both  groups   of   listeners. 

2.  On  the  vowel  stimuli,   the   listeners  who  knew  the  speakers   per- 
formed significantly  above   chance,  while   the   listeners  who  did  not  know 
the  speakers  did  not  perform  above   chance.     Further,  Group  I  performed 
significantly  better  than  Group  II. 

3,  Only  for  Group  I  listeners  was   there  a  significant  vowel 
difference,  with  the   low  vowels   /as/   and   /a/  yielding  better  speaker 
identification  perforraancf   than  the  high  vowels   /i/  and  /u/. 

4,  Significant  speaker  identification  differences  between  the 
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fundamental  frequency  levels  resulted  only  for  Group  Ij      speakers  proved 
to  be  more   correctly  identified  at   98  and   110  Hz  than  at  123  and   139  Hz. 

5.   (a).     Speaker  deviations  from  the  mean  SFF  were  highly  corre- 
lated with  overall  correct   identification  only  for  Group  IIj      the  more 
deviant  a  speaker's  SFF  was   from  the  mean  SFF,   the   better  he  was 
identified. 

5.  (b).      In  general  the   correlations  between  SFF  and  speaker 
confusions  were   insignificant  and  no  trends  were  revealed   in  the  data. 
However,   analysis   of  speaker  confusions    in  terms   of  SFF  and   the   fg   level 
revealed  two  trends j      a  speaker  tended  to  be   confused  with  speakers  whose 
SFFs  were   (a)    siinilar  to  the  SFF  of  the   actual  speaker  or  (b)    close   to 
the   fo  level  being  produced.     Group  I  tended  to  respond  as  noted   in  (a) 
while  Group  II  responded  as   (a)   for  the   lower  two  fg   levels   and   as   (b) 
for  the  higher  fo  levels.     Group  I   listeners  were   generally  consistent   in 
their  confusions,   performing  a  simple  dichotoraization  of  speakers    into 
either  a  high  or   low  SFF  subset.     On  the   other  hand.  Group  II   listeners 
were   apparently  influenced   by  the   fo  level, 

6.  For  both  listener  groups  when  the   speakers  were  paired  according 
to  their  SFFs,   the  two  speaker  pairs  with  the  highest  and   the   lowest  SFFs 
were   perceived  approximately  twice   as  frequently  when  the  fg  level  was 
most  similar  to  their  SFFs  as  when  it  was  most  different  from  their  SFFs. 
This  result  occurred  for  both  the  correct   identification  judgements  and 
for  the  mistakenly-perceived  judgements. 

7.  There  was  no  significant  correlation,  overall,  between  the  mean 
values  for  either  of  the  two  formant  frequencies  (Fl  and  F2)  and  speaker 
conf us  ions . 

The  following  minor  results  were  noted   in  regard  to  speaker 
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differences! 

a.  The   listener  groups  responded  to  speakers  differently,    i.e., 
they  differed,   as  groups,    in  their  ability  to  recognize  a  particular 
speaker. 

b.  Where  vowel  differences   occurred  for  speakers,   the   low  vowels 
generally  resulted   in  higher  speaker   identification  performjince  than  the 
high  vowels.     These  differences  did  not  occur  for  all  speakers,  however. 

c.  l-fhile   some   significant  differences  were   found  between  listener 
performance    in  identifying  particular  speakers   at  different  fo  levels, 
there  was  no  overall  trend.      In  other  words,   no  one  f^  was   found  to  yield 
best  performance  for  all  speakers. 

The  specific  results   noted   in  summary  above,   taken  singly  and   in 
combination,   lead  to  several  conclusions.     All  conclusions   are,   of 
course,  derived  from  the   conditions   utilized   in  the  present  study  and  are 
bound  by  the  scope   of  these   conditions.     Further,   because   of  the   small 
number  of   listeners    in  each  subject  group,   generalization  from  these 
conclusions   should  be  made   only  cautiously.     The  major  conclusions    are 
presented  below. 

The   two  listener  groups  differed  both  quantitatively  and  qualita- 
tively in  their  performance   and  must  be   assumed  to  represent  two  listener 
types  which  cannot  be   utilized  within  one  subject  group. 

Those   cues   pertinent  to  speaker   identification  which  are   present   in 
connected  speech  (a)   are  quickly  assimilated  to  a  useful  degree   by  naive 
listeners  and   (b)   are  present   in  lesser  number  or  to  a  lesser  degree   in 
isolated  vowels.     Since   the  presence   of  such  cues   leads   to  high  speaker 
identification  performance  for  both  listener  types,   it  may  be   concluded 
further  that   listeners   (a)  more  readily  perceive,   (b)   are  more   adept   in 
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utilizing,  or  (c)    both,   these   connected-speech  cues    than  those   present    in 
isolated  vowels. 

Cues   pertinent  to  speaker  identification  persist   in   isolated  vowels 
in  sufficient  number  or  degree  that  acquainted   listeners   can  make 
adequate  discriminations   among  speakers  from  brief   isolated  vowel 
samples.     The   lower  performance  level  achieved  by  the  naive   listeners 
suggests  that  these  cues   are   (a)   difficult  to  perceive,   (b)   not  readily 
learned  or   (c)   both. 

Some   cues   pertinent   to  differentiation  cimong  speakers   are    (a) 
present  only   in  low  vowels,    (b)   present  to  a  greater  degree   in  low  vowels 
or  (c)  more  readily  perceived   in  low  vowels  than   in  high  vowels.     Since 
low  vowels   are  known  to  have  higher  first  formants   than  high  vowels,   one 
is    led  to  conclude  that   some   characteristic(s)   of  the  first  formant   is 
(are)   important   to  the    identification  of  speakers. 

It   is   necessary  to  point  out  that   listener  performance  on  sentence 
stimuli  cannot   be   taken  as   a  neasure   of  listener  familiarity  with  the 
speaker's   voice.     Carrying  the   observation  further,    it  must  be   concluded 
that  high   identification  performance  on  sentence  stimuli  is  not   a 
predictor  for  performance  on   isolated  vowel  stimuli,   at   least  for  naive 
listeners . 

Some   characteristic(s)   of  a  speaker's  voice   is   (are)   distorted  when 
he   produces   an  fo  different  from  his  SFF.     While   the  results   obtained    in 
the   present  study  do  not  establish  that  the   change   in  f©  per  se   is   the 
distorting  factor,   there   is  sufficient  evidence  to   indicate  that  some 
concomitant  to  fo  plays   a  major  role    in  producing  this   distortion.      It 
must  be  recalled  that  the  vowel  stimuli  provided  fundamental  frequency 
information  only  at  specified   levels,   with  all  speakers  producing 
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essentially  identical  foS»   so  that  discrimination  among  speakers   could 
not   be   performed  on  the  basis  of  frequency  information  alone.     Therefore 
it  must  be   concluded  that  SFF   is  related  to  f©  for  both  speaker   identi- 
fication and  speaker  confusion  in  a  complex  manner,   and  that  the 
relationship  entails  some  factor(s)   beyond  a  simple  delineation  of  SFF   in 
terms   of  a  certain  number  of  Hertz. 

V/hile   a  specific  conclusion  is  not  warranted  by  the   limited  evidence 
revealed  by  this   study,    it  must  be   noted  that  naive   listeners   apparently 
were  not  consistent   in  their  judgements.     Although  this  finding  may 
reflect  the  naive   listeners'   unfamiliarity  with  the   speakers'  voices   and 
be,   consequently,  merely  evidence   of  confusion  on  the  part   of  the 
listeners,   there    is  some    indication  that   these   listeners  did   indeed  vary 
their  judgement  criteria. 

The  conclusions  and  observations  noted  above  have   provided  a  certain 
amount  of  additional  information  and  evidence  regarding  the  parameters 
pertinent  to  speaker   identification.     The   present  research,  however,  has 
not  provided  definitive   answers   to  many  questions   and,   in  fact,  has 
raised  additional  questions.     Perhaps   the  most   important  and  immediate   of 
these    is  whether  the  differences    in  listener  performance  are  essentially 
differences    in  degree  and  whether  there  may  be   a  training  procedure  such 
that   these   perfoinnance  differences  may  be  elKninated  or,   at   least,  mini- 
mized.    The   author  suggests  the  following  training  procedure  to  address 
this  question.     Listeners  would  be  provided  with  rigorous   training 
following  a  "step-wise"  procedure  where   they  are   initially  trained  on 
stimuli  in  the  context  of  sentences,   to  an  optimum  performance   level. 
They  would  then  be  trained  on  stimuli  in  the  following  order J      (1) 
phrases,    (2)   plurasyllabic  words,    (3)  monosyllabic  words,    (4)   nonsense 
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monosyllables   and  (5)    isolated  phonemes.     It   is   possible  that  after 
progressive   training  of  this  nature  naive   listener  performance  would  more 
closely  approximate  that  of   listeners  who  know  the  speakers. 

Certainly  other  approaches  to  the   problem  could  be  formulated  and 
the  author  recognizes   that  the  present  research  points   to  many  other 
specifics  requiring  extensive   investigation  in  the  future.     However, 
these  specifics  extend  beyond  the  scope   of  the   present  study  which,    in 
summary,  draws   the  following  conclusionsj 

1»     The   two  listener  groups   utilized   in  the   present  study  represent 
different  types   of  listeners. 

2.  Based  on  the   present  study,    it  may  be  stated  that  there   is  no 
one  parameter  which  is  of  primary  importance   to  the   aural   identification 
of  speakers   for  vowel  samples,    i.e.,   none   of  the   parameters  studied 
(vowels,  fundamental  frequency,  SFF  or  formant  frequencies)   proved  to  be 
of  s  ingular   importance . 

3.  The   overall  conclusion,   therefore,   is   that  several  parameters 
(including  those  not  examined,  such  as  source   characteristics)   comprise   a 
"Gestalt"  and  that  the   percept  a  listener  obtains   and  maintains  for  a 
particular  speaker  is   of  a  suramative  nature  with  no  one  factor  obtaining 
primacy. 


APPENDIX  A 
PASSAGE   READ  BY  SPEAKERS 


Adapted  fromi      "An  Apology  for  Idlers" 
by  Robert  Louis  Stevenson 


Extreme   busyness,  whether  at  school  or  college,   church  or  market, 
is   a  symptom  of  deficient  vitality.     A  faculty  for  idleness    implies   a^ 
catholic  appetite   and  a  strong  sense   of  personal  identity,     Thexre   is   a 
sort  of  dead-alive,  hackneyed  people   about,   who  are  scarcely  conscious 
of  living  except   in  the  exercise   of  some  conventional  occupation.     Bring 
these  fellows   into  the   country,   or  set  them  on  board  ship,   and  you  will 
see  how  they  pine  for  their  desk  or  their  study.     They  have  no  curiosity) 
they  cannot  give   themselves   over  to  random  provocations   nor  do  they  take 
pleasure    in  the  exercise  of  their  faculties  for   its   own  sake.     Unless 
necessity  lays   about  them  with  a  stick,   they  will  even  stand  still.     It 
is  no  good  speaking  to  such  folk.     They  cannot  be   idlej   their  nature   is 
not  generous  enough.     They  pass   those  hours,  which  are   not  dedicated  to 
furious   toiling  in  the   gold-mill,    in  a  sort  of  coma.     When  they  do  not 
require   to  go  to  the   office,  when  they  are  not  hungry  or  have  no  mind  to 
drink,  the  whole   breathing  world   is   a  blank  to  thera. 
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APPENDIX  B 
INSTRUCTIONS  TO  LISTENERS 


INSTRUCTIONS 

This   is   an  experiment   in  speaker  identification.     The  speakers  are 
those  pictured  on  the  wall   in  front  of  you.     Your  first  task  is  merely 
to  listen  to  the  speakers  read  a  passage  to  you.     They  will  each  state 
their  first  names   before  reading  the  passage.     The   first  time   through, 
the  speakers  will  be   in  the  order  of  the  pictures   (left  to  right).     The 
next  two  times   the   order  will  be  different.     After  you  have  heard  the 
passages,  you  will  be  required  to  identify  the  speakers  as   they  read  a 
short  passage.     Later,  you  will  be   asked  to   identify  the   speakers  from 
isolated  vowels. 

Please   indicate  your  answer  by  circling  the  name   on  the   answer 
sheet  of  the  speaker  you  hear.     Be  sure  to  answer  all  items.     If  you 
are  not  sure,  guess.     The  pictures  aire   in  the  same  order  as  the  names  on 
the  answer  sheet • 

There  will  be   a  five-second   interval  between  each  stimulus.     After 
every  tenth  vowel  sample  there  will  be  a  short  tone.     If  you  hear  the 
tone  when  you  have  not   just  completed  a  stimulus  that   is  a  multiple  of 
ten,   please  notify  the  experimenter. 

Please  note  that  a  speaker  might  not  sound  exactly  the  same  for 
every  sample  of  a  particular  vowel. 

Do  you  have  any  questions? 

(Note I     The  last  three  paragraphs  were  given  the  listener  after 
completion  of  the  training  tape   and  the  control  test  tape.) 
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APPENDIX  C 
EXAMPLE   OF  LISTENER  RESPONSE   FORM 


Tape  No. 

Layne 

Name 
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Bob 
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John 

Wayne 
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Wayne 
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Eric 
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Layne 

Bob 

Paul 

John 
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Wayne 
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Paul 
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Wayne  ■ 
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Paul 
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Bob 

Paul 

John 

Wayne 

Eric 
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Layne 
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Paul 

John 

Wayne 
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APPENDIX  D 

CORRECT  IDENTIFICATION  SCORES  FOR  VOWEL  STIMULI  FOR  EACH 
SPEAKER,  VOWEL  AND  fg  LEVEL  FOR  GROUP  I  AND  GROUP  II 


Table  XIX.     Correct  speaker  identification  scores  for  the   vowel  stimuli 
at  the    139  Hz   level  for  Group  I,    listed  by  speakers 


Vowe Is 
/i/  /as/  /a/  /u/       Total 


19  7  51 

20  20  79 
9  1  24 
5  10  30 

17  3  36 

12  6  25 


Speaker  1 

10 

15 

2 

16 

23 

3 

6 

8 

h 

10 

5 

5 

4 

12 

6 

4 

3 

Total 

50 

66 

82  47  245 


Notet     Perfect  score  for  each  cell  =  24 


Table  XX.     Correct  speaker   identification  scores  for  the  vowel  stimuli 
at   the   139  Hz   level  for  Group  II,    listed   by  speakers 


Vowels 
/i/  /as/  /a/  /u/       Total 


Speaker  1 

8 

6 

2 

8 

17 

3 

0 

2 

4 

5 

6 

5 

5 

7 

6 

1 

2 

Total 

27 

40 

9 
11 
2 
2 
5 
2 


31  31  129 


5 

28 

12 

48 

4 

8 

6 

19 

2 

19 

2 

7 

^ioteI      Perfect  score  for  each  cell  =  24 
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Table  XXI.      Correct   speaker   identification  score-;   for  the   vov/el  stimuli 
at   the   123  Hz   level  for  Group  I,   li  •  ad  by  speakers 


Vowels 
HI  1^1  /a/  /u/       Total 


17  4  38 

11  10  36 

10  6  33 
6  17  38 

22  9  53 

11  11  60 


Speaker  1 

10 

7 

2 

5 

10 

3 

7 

10 

4 

8 

7 

5 

9 

13 

6 

22 

16 

Total 

61 

63 

77  57  258 


Note  J      Perfect  score  for  each  cell  =  24 


Table  XXII »     Correct  speaker   identification  scores   for  the  vowel  stimuli 
at  the   123  Hz  level  for  Group  II,   listed  by  speakers 


Vowels 
HI         1^1  hi  lul       Total 


Speaker  1 

7 

4 

4 

5 

20 

2 

3 

12 

5 

5 

25 

3 

0 

2 

2 

2 

6 

4 

3 

1 

0 

4 

8 

5 

2 

4 

5 

8 

19 

6 

8 

2 

3 

3 

16 

Total     23  25  19  27  94 


Note  I      Perfect  score  for  each  cell  =  24 
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Table  XXIII. 


Correct  speaker   identification  scores  for  the   vowel  stimuli 
at  the   110  Hz   level  for  Group  I,    listed  by  speakers 


Vowels 


/i/ 


/as/ 


/q/ 


/u/ 


Total 


Speaker  1 

9 

17 

2 

3 

7 

3 

8 

7 

4 

13 

7 

5 

13 

19 

6 

20 

22 

20 

7 

53 

3 

13 

26 

14 

2 

31 

12 

14 

46 

16 

12 

60 

22 

24 

88 

Total     66 


79 
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Note:      Perfect  score  for  each  cell  =  24 


72 


304 


Table  XXIV. 


Correct  speaker  identification  scores  for  the  vowel  stimuli 
at   the   110  Hz  level  for  Group  II,    listed  by  speakers 


/i/ 


Speaker  1 
2 
3 
4 
5 
6 


4 
6 
3 
1 
5 
10 


Total     29 


Vowels 
/ae/  /q/ 


8 
9 
1 
5 
4 
5 


32 


8 
5 
5 
1 
4 
3 


26 


Notei   Perfect  score  for  each  cell  =  24 


/u/   Total 


4 

24 

9 

27 

3 

12 

1 

8 

3 

16 

6 

24 

26 


113 
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Table   XXV.      Correct   speaker   identification  scores   for  the   vowel  stimuli 
at   the    98  Hz   level  for  Group  I,    listed  by  sp)eakers 


Vowels 
HI  I  3zl  hi  lul       Total 


6  3  35 

6  12  30 

18  11  48 

13  5  36 

14  10  58 
23  23  89 


Total     60  92  80  64  296 


Note  I      Perfect  score  for  each  cell  =  24 


Speaker  1 

8 

18 

2 

4 

8 

3 

4 

15 

4 

11 

7 

5 

14 

20 

6 

19 

24 

Table  XXVI.     Correct  speaker   identification  scores  for  the  vowel  stimuli 
at  the    98  Hz   level  for  Group  II,    listed   by  speakers 


Vowels 
/i/         /je/  /c/  /u/       Total 


5  2  15 

5  2  23 

7  6  20 

2  5  12 

2  6  16 

11  5  37 

32  26  123 


Speaker  1 

2 

6 

2 

7 

9 

3 

0 

7 

4 

3 

2 

5 

2 

6 

6 

10 

11 

Total 

24 

41 

Note:      Perfect  score  for  each  cell  =  24 


APPENDIX  E 

VOl^L  FORMANT   FREQUENCY  MEANS   FOR  FORMANT 
ONE   AND   FOPvMANT   TWO   FOR  EACH   SPEAKER 
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APPErroix  F 

FORMANT  FREQUENCY  RANK-ORDER  CORRELATIONS   BETWEEN 
EXPECTED  AND  ACTUAL  SPEAKER  CONFUSIONS 


Table  XXVIII.    Forraant  One   and  Formant  Two  rank-order  correlations  for 
Group  I  between  actual  confusions   among  speakers   and 
expected  confusions   among  speakers   for  the  vowel  stimuli 


Speaker 

1 

2 

3 

4 

5 

6 

Fl 

.hi 

.33 

.60 

-.14 

-.07 

.07 

/i/ 

P* 

.19 

.35 

.09 

.70 

.85 

.85 

F2 

.07 

.20 

.20 

-.23 

.60 

-.60  ' 

P 

.85 

.57 

.57 

.43 

.09 

.09 

Fl 

.55 

.50 

.28 

.69 

.28 

-.55 

/ae/ 

P 

.12 

.16 

.44 

.05-^* 

.44 

.12 

F2 

.23 

.07 

.41 

.14 

.55 

.33 

P 

,hh 

.84 

.24 

.70 

.12 

.55 

Fl 

.60 

.47 

.83 

-.20 

-.60 

.20 

/q/ 

P 

.09 

.19 

.02'''-'^ 

.58 

.09 

.57 

F2 

-.07 

-.20. 

.14 

-.60 

-.26 

.60 

P 

.85 

.57 

.70 

.09 

.47 

.09 

Fl 

.41 

-.41 

.55 

-.20 

-.14 

.28 

/u/ 

P 

.24 

.24 

.12 

.57 

.70 

.44 

F2 

.55 

.41 

-.14 

-.07 

-.41 

.28 

P 

.12 

.24 

.70 

.85 

.24 

.44 

*p  =  probability  of  occurrence 
**signif icant   at  5%   level  of  confidence 
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Table    XXIX.      Formant  One   and  Forraant  Two  rank-order  correlations  for 
Group  II  between  actual  confusions   among  speakers  and 
expected  confusions   among  speakers  for  the  vowel  stimuli 


Speaker 

1 

2 

3 

4 

5 

6 

Fl 

-.14 

.20 

-.07 

.20 

.41 

.47 

/I/ 

P* 
F2 

.70 
.33 

.57 
-.20 

.85 
-.47 

.57 
.14 

.24 
-.14 

.19 
-.28 

P 

.35 

.57 

.18 

.70 

.70 

.44 

Fl 

.60 

.60 

.73 

.47 

.60 

-.41 

/a/ 

P 

.09 

.09 

.04"'' 

.19 

.09 

.24 

F2 

.20 

.20 

.33 

-.20 

.20 

.14 

P 

.57 

.57 

.35 

.57 

.57 

.70 

Fl 

.69 

.73 

.55 

.47 

-.14 

.30 

hi 

P 

.05"-^ 

.04— 

.12 

.19 

.70 

.40 

F2 

.41 

.33 

-.14 

.60 

.33 

.60 

P 

.24 

.35 

.70 

.09 

.35 

.09 

Fl 

.50 

-.14 

-.41 

.20 

-.33 

-.14 

p 

.18 

.70 

.24 

.57 

.35 

.70 

1^1 

F2 

.22 

.14 

-.28 

-.47 

-.20 

-.28 

P 

.55 

.70 

.44 

.18 

.57 

.44 

*p  =  probability  of  occurrence 
**signif icant  at  5X   level  of  confidence 
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